cancerit / ascatNgs

Somatic copy number analysis using WGS paired end wholegenome sequencing
http://cancerit.github.io/ascatNgs/
GNU Affero General Public License v3.0
68 stars 17 forks source link

Issue of Human reference files from 1000 genomes VCFs #100

Closed MimoriK closed 3 years ago

MimoriK commented 3 years ago

I used the method on the wiki to generate the file : SnpPositions_GRCh37_1000g.tsv (https://github.com/cancerit/ascatNgs/wiki/Human-reference-files-from-1000-genomes-VCFs). I have downloaded the file : 1000GENOMES-phase_3.vcf.gz (ftp://ftp.ensembl.org/pub/grch37/release-83/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz) and tried these two codes, but no matter which one it is, an empty output file is generated.

zgrep -F 'E_Multiple_observations' 1000GENOMES-phase_3.vcf.gz | grep -F 'TSA=SNV' |\
 perl -ane '\
 next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); \
 next if($F[0] eq $l_c && $F[1]-1000 < $l_p); \
 $F[7]=~m/MAF=([^;]+)/; next if($1 < 0.05); \
 printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];' \
 > SnpPositions_GRCh37_1000g.tsv
zgrep -F 'E_Multiple_observations' 1000GENOMES-phase_3.vcf.gz | grep -F 'TSA=SNV' |\
 perl -ane '\
 next if($F[0] !~ m/^\d+$/ && $F[0] !~ m/^[XY]$/); \
 $F[7]=~m/MAF=([^;]+)/; next if($1 < 0.05); \
 next if($F[0] eq $l_c && $F[1]-1000 < $l_p); \
 printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];' \
 > SnpPositions_GRCh37_1000g.tsv

I don't know where the problem is. When the code has not run to this line:

printf "%s\t%s\t%d\n", $F[2],$F[0],$F[1]; $l_c=$F[0]; $l_p=$F[1];

, what are $l_c and $l_p? The issue that was closed before answered incorrectly (https://github.com/cancerit/ascatNgs/issues/50).

keiranmraine commented 3 years ago

I've updated the previous issue (#50) as you've commented on that.

$l_c and $l_p are last chromosome, and last position, this is used to prevent SNPs that are very close together being generated. The code enforces a minimum distance of 1kb between output..