WGLab / doc-ANNOVAR

Documentation for the ANNOVAR software
http://annovar.openbioinformatics.org
234 stars 359 forks source link

Versioning #129

Closed roselucia closed 3 years ago

roselucia commented 3 years ago

Dear Dr. Wang,

I used ANNOVAR in my research and have a question in regard to the versioning of some databases used.

1)What is the exact name of the 1000 Genome data release you used for generating the dataset 1000g2015aug? I found the following information on your website: 201508 collection v5b (based on 201305 alignment). Is the exact name of the data release at 1000 Genome? (Are you referring to those files: ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz released 18-Aug-2015 15:28, ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz.tbi released 18-Aug-2015 15:08.) 2) What release version is used for refGene 20190929 and ensGene 20190929. Is this refSeq release 95 and ensGene release 98? 3) Where is the original source of the cytoband information and if a release version exist which release was available 07 Nov 2019?

Thanks a lot!

All the best, Rose Fröhlich

kaichop commented 3 years ago
  1. The original files are downloaded directly from 1000G website as vcf files and then reformatted. So "201508 collection v5b" should be the annotation on the 1000g website
  2. No, this is the date when the files are downloaded, it is not relevant to a specific version, because UCSC genome browser does not follow the same version as original data releases.
  3. this is from UCSC genome browser and I do not think this information can change. It changes only for different genome builds, but unlikely to change for the same build over time.

On Fri, Mar 12, 2021 at 4:22 AM roselucia @.***> wrote:

Dear Dr. Wang,

I used ANNOVAR in my research and have a question in regard to the versioning of some databases used.

1)What is the exact name of the 1000 Genome data release you used for generating the dataset 1000g2015aug? I found the following information on your website: 201508 collection v5b (based on 201305 alignment). Is the exact name of the data release at 1000 Genome? 2) What release version is used for refGene 20190929 and ensGene

  1. Is this refSeq release 95 and ensGene release 98? 3) Where is the original source of the cytoband information and if a release version exist which release was available 07 Nov 2019?

Thanks a lot!

All the best, Rose Fröhlich

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/129, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OEXSYSELRGG6CXQBULTDHMO7ANCNFSM4ZB5A3XA .

roselucia commented 3 years ago

Dear Dr. Wang,

thanks for your fast repose. I am trying to specify the data which was used for annotation with ANNOVAR in my thesis.

1) Thanks! Then I will specify the source of the database 1000g2015aug in my thesis as followed: 1000G phase 3 201508 collection v5b.

2) In your release notes of your website I found this information: "new2019Sep29: All ANNOVAR databases are transferred to AWS S3, including large files. The refGene, refGeneWithVer, knownGene and ensGene (same as GencodeBasicV31) for hg18/hg19/hg38 are updated to the latest version."(https://annovar.openbioinformatics.org/en/latest/ ) With this information I looked up what was the latest version of the RefSeq and the Ensemble database, which was refSeq release 95 and ensGene release 98. So is it correct that your ANNOVAR databases refGene 20190929 and ensGene 20190929 are based on refSeq release 95 and ensGene release 98? If not how can I specify the data version?

3) Thanks. Do you know what paper should be cited using cytoband data?

Thanks your ever so much!

All the best!

roselucia commented 3 years ago

Dear Dr. Wang I would be very grateful for a response! All the best, Rose

kaichop commented 3 years ago

For the question 2, these datasets were updated 20190919 ( https://annovar.openbioinformatics.org/en/latest/user-guide/download/), so I think your estimation is likely correct. (However, I should mention that UCSC creates their own gene definition from RefSeq (which is just a list of transcripts) by mapping to a specific genome assembly, so I do not really think these release numbers mean much.

For the question 3, this is basically determined by genome assembly (a cytoband is something observed under a microscope, so it cannot have single-base resolution); once an assembly is made, the cytoband is then estimated from the assembly with a resolution of around 100kb. In terms of exact methods, you can see below

A full description of the method by which the chromosome band locations are estimated can be found in Furey and Haussler, 2003.

Barbara Trask, Vivian Cheung, Norma Nowak and others in the BAC Resource Consortium used fluorescent in-situ hybridization (FISH) to determine a cytogenetic location for large genomic clones on the chromosomes. The results from these experiments are the primary source of information used in estimating the chromosome band locations. For more information about the process, see the paper, Cheung, et al., 2001. and the accompanying web site, Human BAC Resource https://www.ncbi.nlm.nih.gov/genome/cyto/hbrc.shtml.

On Fri, Mar 12, 2021 at 10:35 AM roselucia @.***> wrote:

Dear Dr. Wang,

thanks for your fast repose. I am trying to specify the data which was used for annotation with ANNOVAR in my thesis.

1.

Thanks! Then I will specify the source of the database 1000g2015aug in my thesis as followed: 1000G phase 3 201508 collection v5b. 2.

In your release notes of your website I found this information: "new2019Sep29: All ANNOVAR databases are transferred to AWS S3, including large files. The refGene, refGeneWithVer, knownGene and ensGene (same as GencodeBasicV31) for hg18/hg19/hg38 are updated to the latest version."( https://annovar.openbioinformatics.org/en/latest/ ) With this information I looked up what was the latest version of the RefSeq and the Ensemble database, which was refSeq release 95 and ensGene release 98. So is it correct that your ANNOVAR databases refGene 20190929 and ensGene 20190929 are based on refSeq release 95 and ensGene release 98? If not how can I specify the data version? 3.

Thanks. Do you know what paper should be cited using cytoband data?

Thanks your ever so much!

All the best!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/129#issuecomment-797566478, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3ODQVAOGSQR2U5MNGUTTDIYDJANCNFSM4ZB5A3XA .

roselucia commented 3 years ago

Dear Dr. Wang, thanks a lot for your help. I will then just specify the date of the last update of your refGene and ensGene database.

All the best, Rose