WGLab / doc-ANNOVAR

Documentation for the ANNOVAR software
http://annovar.openbioinformatics.org
218 stars 329 forks source link

A problem in using hg19_snp138.txt #199

Open SYSUMSD opened 1 year ago

SYSUMSD commented 1 year ago

I have used annovar to annotate human SNPs.

After using convert2annovar.pl to convert a snp name list to a format which table_annovar.pl required, I found rs12990866 has 9 locations at different chromosome.

So I checked hg19_snp138.txt and found a strange thing.

rs12990866 corresponds to 8 lines in hg19_snp138.txt . It looks like this snp id corresponds to 8 locations in snp138:

1239 chr14 85836141 85836142 rs12990866 0 - A A C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0
645 chr18 7866051 7866052 rs12990866 0 - T T C/T genomic single unknown 0 0 unknown exact 3 ObservedMismatch,MultipleAlignmentABI,BCMHGSC_JDW,SSAHASNP, 0
1738 chr3 151204911 151204912 rs12990866 0 + T T C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0
998 chr6 54195546 54195547 rs12990866 0 + C C C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0
1490 chr6 118745488 118745489 rs12990866 0 + T T C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0
947 chr8 47522774 47522775 rs12990866 0 + T T C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0
1510 chr9 121260627 121260628 rs12990866 0 - A A C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0
1639 chrX 138258432 138258433 rs12990866 0 - G G C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0

But in NCBI, rs12990866 is located at chr2:209372299(https://www.ncbi.nlm.nih.gov/snp/rs12990866).

Are there errors in hg19_snp138.txt?

kaichop commented 1 year ago

rs identifier is compiled by dbSNP. If the same context of sequence occurs multiple times in the reference genome, then you will see a rs ID be mapped to multiple positions by UCSC. There is nothing wrong, just a bad SNP. hg19_snp138 is actually generated by UCSC, not ANNOVAR. I suggest not to use rs ID or dbSNP in any genomic data analysis. Only use chr:start-end if possible.

On Sun, Jul 24, 2022 at 8:59 AM SIDI MA @.***> wrote:

I have used annovar to annotate human SNPs.

After using convert2annovar.pl to convert a snp name list to a format which table_annovar.pl required, I found rs12990866 has 9 locations at different chromosome.

So I checked hg19_snp138.txt and found a strange thing.

rs12990866 corresponds to 8 lines in hg19_snp138.txt . It looks like this snp id corresponds to 8 locations in snp138:

1239 chr14 85836141 85836142 rs12990866 0 - A A C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0 645 chr18 7866051 7866052 rs12990866 0 - T T C/T genomic single unknown 0 0 unknown exact 3 ObservedMismatch,MultipleAlignmentABI,BCMHGSC_JDW,SSAHASNP, 0 1738 chr3 151204911 151204912 rs12990866 0 + T T C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0 998 chr6 54195546 54195547 rs12990866 0 + C C C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0 1490 chr6 118745488 118745489 rs12990866 0 + T T C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0 947 chr8 47522774 47522775 rs12990866 0 + T T C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0 1510 chr9 121260627 121260628 rs12990866 0 - A A C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0 1639 chrX 138258432 138258433 rs12990866 0 - G G C/T genomic single unknown 0 0 unknown exact 3 MultipleAlignmentsABI,BCMHGSC_JDW,SSAHASNP, 0

But in NCBI, rs12990866 is located at chr2:209372299( https://www.ncbi.nlm.nih.gov/snp/rs12990866).

Are there errors in hg19_snp138.txt?

— Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/199, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OHB3Y3QW2VMI5GETO3VVU445ANCNFSM54PUFN2Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

SYSUMSD commented 1 year ago

@kaichop I think it's a good idea. But I find some glitches.

  1. The avinput file seems to need reference allele's start position and end position. But in my data I can't confirm which allele is reference allele to calculate the start position and end position. Can I use allele 1 as reference allele?
  2. The rs2066847 in your ex1.avinput has information like : 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2 But in my data this snp has information like : 16 rs2066847 50763778 G GC For indels my data has a different format in allele1 and allele2.

What can these differences affect?