Closed countdigi closed 4 years ago
Sorting the VCF file by rs #:
$ wc -l tmp/grch37_duplicates.txt
19376812 tmp/grch37_duplicates.txt
19 million duplicate entries inside of the VCF file.
$ head tmp/grch37_duplicates.txt
rs1000
rs1000000008
rs1000000109
rs1000000370
rs1000000381
rs10000004
rs1000000503
rs1000000566
A few examples that the correct rs # to use appears first inside of the VCF
$ zgrep -w rs1000431655 tmp/data/grch37_vcf-03-11-2020/GCF_000001405.25.gz
NC_000005.9 70906025 rs1000431655 T C . . RS=1000431655;dbSNPBuildID=150;SSR=0;GENEINFO=MCCC2:64087;VC=SNV;INT;GNO;FREQ=GnomAD:1,3.187e-05|TOPMED:0.9999,8.76e-05
NW_003315917.2 1608683 rs1000431655 T C . . RS=1000431655;dbSNPBuildID=150;SSR=0;GENEINFO=MCCC2:64087;VC=SNV;U3;GNO;FREQ=GnomAD:1,3.187e-05|TOPMED:0.9999,8.76e-05
(base) jmoreno8@svc-3024-5-10.rc.usf.edu:~/dev/usf-hii/snptk (master u=)
$ zgrep -w rs1000 tmp/data/grch37_vcf-03-11-2020/GCF_000001405.25.gz
NC_000006.11 32153895 rs1000 T C . . RS=1000;dbSNPBuildID=36;SSR=1;GENEINFO=AGER:177|PBX2:5089;VC=SNV;PUB;U3
NT_113891.2 3624573 rs1000 T C . . RS=1000;dbSNPBuildID=36;SSR=1;GENEINFO=AGER:177|PBX2:5089;VC=SNV;PUB;U3
NT_167244.1 3468633 rs1000 T C . . RS=1000;dbSNPBuildID=36;SSR=1;GENEINFO=AGER:177|PBX2:5089;VC=SNV;PUB;U3
NT_167246.1 3496716 rs1000 T C . . RS=1000;dbSNPBuildID=36;SSR=1;GENEINFO=AGER:177|PBX2:5089;VC=SNV;PUB;U3
NT_167247.1 3533719 rs1000 T C . . RS=1000;dbSNPBuildID=36;SSR=1;GENEINFO=AGER:177|PBX2:5089;VC=SNV;PUB;U3
NT_167248.1 3414912 rs1000 T C . . RS=1000;dbSNPBuildID=36;SSR=1;GENEINFO=AGER:177|PBX2:5089;VC=SNV;PUB;U3
NT_167249.1 3501621 rs1000 T C . . RS=1000;dbSNPBuildID=36;SSR=1;GENEINFO=AGER:177|PBX2:5089;VC=SNV;PUB;U3
(base) jmoreno8@svc-3024-5-10.rc.usf.edu:~/dev/usf-hii/snptk (master u=)
$ zgrep -w rs12191629 tmp/data/grch37_vcf-03-11-2020/GCF_000001405.25.gz
NC_000006.11 26588461 rs12191629 G A . . RS=12191629;dbSNPBuildID=120;SSR=0;VC=SNV;GNO;FREQ=1000Genomes:0.9193,0.08067|ALSPAC:0.8695,0.1305|Estonian:0.8529,0.1471|GnomAD:0.8874,0.1126|NorthernSweden:0.845,0.155|TOPMED:0.8946,0.1054|TWINSUK:0.8573,0.1427|Vietnamese:0.9815,0.01852;COMMON
NW_004070866.1 2619 rs12191629 G A . . RS=12191629;dbSNPBuildID=120;SSR=0;VC=SNV;GNO;FREQ=1000Genomes:0.9193,0.08067|ALSPAC:0.8695,0.1305|Estonian:0.8529,0.1471|GnomAD:0.8874,0.1126|NorthernSweden:0.845,0.155|TOPMED:0.8946,0.1054|TWINSUK:0.8573,0.1427|Vietnamese:0.9815,0.01852;COMMON
(base) jmoreno8@svc-3024-5-10.rc.usf.edu:~/dev/usf-hii/snptk (master u=)
$ zgrep -w rs6925087 tmp/data/grch37_vcf-03-11-2020/GCF_000001405.25.gz
NC_000006.11 26592853 rs6925087 T A . . RS=6925087;dbSNPBuildID=116;SSR=0;VC=SNV;GNO;FREQ=1000Genomes:0.4834,0.5166|ALSPAC:0.5739,0.4261|Estonian:0.5701,0.4299|GnomAD:0.5543,0.4457|NorthernSweden:0.56,0.44|TOPMED:0.5418,0.4582|TWINSUK:0.5957,0.4043|Vietnamese:0.3009,0.6991;COMMON
NW_004070866.1 7011 rs6925087 T A . . RS=6925087;dbSNPBuildID=116;SSR=0;VC=SNV;GNO;FREQ=1000Genomes:0.4834,0.5166|ALSPAC:0.5739,0.4261|Estonian:0.5701,0.4299|GnomAD:0.5543,0.4457|NorthernSweden:0.56,0.44|TOPMED:0.5418,0.4582|TWINSUK:0.5957,0.4043|Vietnamese:0.3009,0.6991;COMMON
After discussion, we concluded that we should grep on NC for correct position in VCF.
After grepping for NC
, I found that there could be snps with multiple mappings that are valid but map to multiple positions for example:
$ zgrep -w 1048203695 tmp/work/dbsnp-GRCh37.gz
1048203695 Y 521433 -
1048203695 Y 471433 -
dbsnp (online):
GRCh37.p13 chr X | NC_000023.10:g.521433T>C
-- | --
GRCh37.p13 chr Y | NC_000024.9:g.471433T>C
Looking at the duplicates left inside of GRCh37, we see that they all seem to be from the PAR region
$ head tmp/grch37_duplicates.txt
1000000852
1000002895
1000004843
1000011319
1000011409
1000011813
1000012343
1000025549
1000027677
1000028487
(base) jmoreno8@hii2.rc.usf.edu:~/dev/usf-hii/snptk (master * u=)
$ wc -l !$
wc -l tmp/grch37_duplicates.txt
776457 tmp/grch37_duplicates.txt
This issue was handled by outputting 2 separate files using snptk-map-grch37-chromosomes.py
:
<outfile>.gz
<outfile>_multi_entries.gz
Added functionality in snptk map-using-rs-id
--include-file
to be able to handle these multi entry snps and not update due to ambiguity of update. Multi entires snps are all valid snps.