RefSeq Database Version

malnirav commented 1 year ago

What version of RefSeq database is downloaded/used in the annotation cache when executing: dotnet Downloader.dll --ga GRCh37 -o Data? Is there a way to control/leverage a specific transcript database version?

naserelmi commented 1 year ago

I think you can change them by editing the ftp address of your file in the following cs code. /CacheUtils/Commands/Download/ExternalFiles.cs then rebuild it. GCF_000001405.25_GRCh37.p13 has been used.

GH-MStamboulian commented 10 months ago

@naserelmi @malnirav did you guys figure this out? I dont think those FTP files located here are being used at all

https://github.com/Illumina/Nirvana/blob/main/CacheUtils/Commands/Download/ExternalFiles.cs

the Downloader.dll command is just using a pre-compiled database chached thats put in S3 buckets and it just downloads those every time.

@MichaelStromberg can you please explain how this script found here: https://github.com/Illumina/Nirvana/blob/main/CacheUtils/Commands/Download/ExternalFiles.cs is used and whats its purpose? how can we feed in the transcripts that we want? Nirvana is making alot of mistakes in annotating some variants and its not obvious how you guys pre-compiled these transcripts for some of the genes...

GH-MStamboulian commented 10 months ago

@rajatshuvro please be advised ... your inputs appreciated!

Cimorgh-IT commented 10 months ago

@GH-MStamboulian, You are right. The ExternalFiles.cs was not used. The Nirvana team changed the reference gene model to the newest version of RefGene and Ensemble Gene in their tool's last version, which needs an academic license.

GH-MStamboulian commented 10 months ago

@Cimorgh-IT which version is this? is there something newer than 3.18.1?

rajatshuvro commented 10 months ago

@Cimorgh-IT is correct. The current latest version is 3.22.0 and is available here: https://developer.illumina.com/illumina-connected-annotations. It is free for academic use.

3.22.0 uses updated gene models from RefSeq and Ensembl. Once you run the downloader, you will get all the files needed to run Nirvana. The new documentation is available here: https://illumina.github.io/IlluminaConnectedAnnotationsDocumentation/

Thanks Rajat

GH-MStamboulian commented 10 months ago

have any of you installed this before? im not able to follow the instructions, I downloaded the zip file and then the shell script thats supposed to unzip it and build it but it keeps erroring out:

making directory /home/mstamboulian/IlluminaConnectedAnnotationsTest/build
unzip:  cannot find or open IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip, IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip.zip or IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip.ZIP.
Download all data files
making directory /home/mstamboulian/IlluminaConnectedAnnotationsTest/Data
Could not execute because the specified command or file was not found.
Possible reasons for this include:
  * You misspelled a built-in dotnet command.
  * You intended to execute a .NET program, but dotnet-/home/mstamboulian/IlluminaConnectedAnnotationsTest/build/Downloader.dll does not exist.
  * You intended to run a global tool, but a dotnet-prefixed executable with this name could not be found on the PATH.
run Illumina Connected Annotations on a test VCF file
Could not execute because the specified command or file was not found.
Possible reasons for this include:
  * You misspelled a built-in dotnet command.
  * You intended to execute a .NET program, but dotnet-/home/mstamboulian/IlluminaConnectedAnnotationsTest/build/Annotator.dll does not exist.
  * You intended to run a global tool, but a dotnet-prefixed executable with this name could not be found on the PATH.

The zip file I got is IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip

and the shell script to unpack it is https://illumina.github.io/IlluminaConnectedAnnotationsDocumentation/assets/files/TestIlluminaConnectedAnnotations-e26785a7184802763e147e22e2a39eb6.sh

@rajatshuvro any advice?

GH-MStamboulian commented 10 months ago

I guess I need the IlluminaConnectedAnnotationsBuild.zip file instead ... can someone point me to this file please?

rajatshuvro commented 10 months ago

Hi @GH-MStamboulian , Sorry about that. Please unzip the file using unzip:

unzip IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip

This will produce the necessary DLLs (Downloader.dll, Nirvana.dll, Annotator.dll). You can then follow the instructions at https://illumina.github.io/IlluminaConnectedAnnotationsDocumentation/introduction/getting-started#downloading-the-data-files to run the Downloader and obtain the files.

rajatshuvro commented 10 months ago

@GH-MStamboulian , it works if you provide the full path to the IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip file.

GH-MStamboulian commented 10 months ago

@rajatshuvro thank you for this I actually was able to unzip that file and use the Downloaded.dll to download the reference database files and then use the Nirvana.dll to annotate a vcf file to test it, however Illumina annotator (the new software) is still making the same mistakes as Nirvana regardless. I can see that you guys updated the transcripts, (the transcript minor versions are now more up to date) however lets take an example here to explain the issue:

lets take the following variant: 1-98348885-G-A, bascally chromosome : 1, position : 98348885 and mut_nt G>A

then based on Nirvana and Illumina Annotator we get the following refseq annotation:

{'transcript': 'NM_000110.4',
   'source': 'RefSeq',
   'bioType': 'mRNA',
   'codons': 'Tgt/Tgt',
   'aminoAcids': 'C',
   'cdnaPos': '197',
   'cdsPos': '85',
   'exons': '2/23',
   'proteinPos': '29',
   'geneId': '1806',
   'hgnc': 'DPYD',
   'consequence': ['synonymous_variant'],
   'impact': 'low',
   'hgvsc': 'NM_000110.4:c.85=',
   'hgvsp': 'NP_000101.2:p.(Cys29=)',
   'isCanonical': True,
   'proteinId': 'NP_000101.2'}

as you notice it annotates this as as synonymous variant, and assigns it to codon Tgt to a Tgt (no transition) however in fact this should have been a missense mutation instead. I have used hg19 to call my variants and also RefSeq to interpret the transcript and in this case, which is the gene DPYD I also use the same RefSeq transcript as Nirvana for this gene which is 'NM_000110.4'

You can double check that this is in fact the wrong annotation on UCSC genome browser (make sure you pick up hg19 as your reference) and enter the genomic location and you can see that this genomic location maps to the codon ACG which codes for Arginine (R), and then a G>A transition would result the codon Cgt>Tgt transition on the reverse transcript which in turn is a missense variant thats Arg29Cys on the protein level.

Funny enough the annotation based on the ensembl transcripts are correct:

{'transcript': 'ENST00000370192.3', 'source': 'Ensembl', 'bioType': 'mRNA', 'codons': 'Cgt/Tgt', 'aminoAcids': 'R/C', 'cdnaPos': '186', 'cdsPos': '85', 'exons': '2/23', 'proteinPos': '29', 'geneId': 'ENSG00000188641', 'hgnc': 'DPYD', 'consequence': ['missense_variant'], 'impact': 'moderate', 'hgvsc': 'ENST00000370192.3:c.85C>T', 'hgvsp': 'ENSP00000359211.3:p.(Arg29Cys)', 'isCanonical': True, 'proteinId': 'ENSP00000359211.3'}

@rajatshuvro can you please let me know what the issue is here and if infact theres a bug in the annotator that needs to be addressed?

naserelmi commented 10 months ago

@GH-MStamboulian , @rajatshuvro Thanks for informing this point. I double checked your variant using ensemble VEP on GRCh37. The results where same as the new version of Nirvana. I think since these databases Ensembl and Refseq use different gene models, hence a little difference in some variants are possible. To solve this issue MANE Select term developed, which means two variants from Ensembl and Refseq are equivalent. Since these variants are just canonical, and not MANE select, I think the result is correct.

heseber commented 10 months ago

This is most likely one of those cases where the GRCh37 genome does not have the most common allele of a polymorphism, while the MANE transcript uses the most common variant. You can check Gnomad (https://gnomad.broadinstitute.org/variant/1-98348885-G-A?dataset=gnomad_r2_1), this variant has an allele frequency > 70% and a GrpMax Filtering Allele Frequency > 80%. I have seen such examples quite often. This is one of the consequences that the GRCh37 genome sequence is based on a relatively small number of individuals, and there are differences between GRCh37 and GRCh38 in terms of which variant of a polymorphism is in the genomic backbone. By the way, one might suspect that the MANE transcripts always have the most common variant of a SNP, but that's also not always true. And even worse, I have seen examples where a previous version of a transcript database entry of a MANE transcript had the most common SNP, but the current version of the MANE transcript has not. So it's kind of a mess, and leads to such issues as you described. It is always a good idea to check the allele frequency of a variant, and unless you are interested in SNPs, to filter out SNPs.

In your case, NM_000110.4 seems to contain the most common variant of this polymorphism, so no change is reported because the annotator compares against the transcript sequence and not against the genome sequence (it is still slightly confusing that it is reported as "synonymous" while it is really "no change", but I think there is no ontology term for this consequence), while ENST00000370192.3 has the minor variant of this polymorphism, which is why this is then reported as a missense variant.

rajatshuvro commented 10 months ago

You are exactly right @heseber . Thanks.

GH-MStamboulian commented 10 months ago

@rajatshuvro @naserelmi @heseber why is this going to be interpreted by allele frequencies? you can check that at chromosome 1 position 98348885 on GRCh37, the wildtype base is a G, where is the triplet TGT even coming from? It doesnt even make sense on the reverse transcript which is ACA, you can check this on the genome browser and that base is in fact a G as the reference and the surrounding triplet is acG , which would be Cgt on the reverse transcript. I didnt know that the reference sequence is up to interpretation by allele frequencies, it should be a fixed reference and what Nirvana is getting is the wrong reference base. Can you please elaborate?

GH-MStamboulian commented 10 months ago

@rajatshuvro so you guys aren't using the standard reference sequence hg19? and you have your own version of it?

heseber commented 10 months ago

Dear @GH-MStamboulian, I am sure as a bioinformatician you know what a polymorphism is. Just to recapitulate: there are locations in the genome where different individuals have a different sequence, there is variability of the genomic sequence across the human population. If at least two such different variants occur often enough in the population (a commonly used threshold is more than 1%), it is called a polymorphism, and if the change is only a single base (that is, no insertion or deletion), it is called a single nucleotide polymorphism (SNP). The GRCh37 genomic sequence does not reflect this variability because it defines a single unique nucleotide for each genomic location, whereas in reality the sequence at this location can have different nucleotides with different probabilities across the population. While GRCh37 always represents only a single variant of a polymorphism, which is not always the most common variant in the population, different transcripts from a transcript database can reflect the naturally occurring variability if they are derived from different individuals. So it may happen that, when sequencing transcripts or DNA, the sequence you measure corresponds to the most common variant in the population (which is even most likely), but it is different from the GRCh37 genomic sequence because the GRCh37 has a rare variant of this polymorphism in its sequence. This is the case for your example. If different transcripts from a transcript database represent different variants of a polymorphism, they have a slightly different sequence, for some of these transcripts a detected nucleotide actually means a change of sequence, while for some it means no change. It would be wrong to assign the same change (missense or synononymous or ...) to all transcripts, the correct solution is to compare the detected nucleotide with the sequence of each transcript, not with the sequence of the genome, because the genomic sequence of the GRCh37 genome is in disagreement with the sequence of a transcript if the genome sequence and the transcript sequence represent different variants of a polymorphism. The correct solution for annotating genomic variants is therefore to find all transcripts (isoforms, transcript database entries) that overlap with a particular genomic location, and then to compare the detected sequence (ALT from the VCF file) with the sequence of each of these transcripts, not with the sequence of the genome. In summary, the results provided by Illumina's annotator are correct.

rajatshuvro commented 10 months ago

Thanks again @heseber , I couldn't have said it any better. Hope that clarifies it @GH-MStamboulian .

GH-MStamboulian commented 10 months ago

@heseber thanks for the lesson :) but thats not quite what I was asking or looking for, @rajatshuvro I just think its misleading in Nirvana to name the reference database GRCh37, whereas in fact its not and deviates from the original reference thats published based on base frequencies obtained from a larger population. Can you please provide the reference sequence that Nirvana uses that overlays the suggested transcripts against and makes the annotation calls?

heseber commented 10 months ago

@GH-MStamboulian, there is nothing that is misleading, I have the feeling you just don't get it, although I really tried hard to help you to understand. I give up now, I have no time to elaborate further.

GH-MStamboulian commented 10 months ago

@heseber I never asked you to elaborate anything your just assigning that role to yourself, carry on with your day! @rajatshuvro I just want to understand how internally these annotations are passed? like do you have allele frequencies and compute what should be the representative on the fly or you have a pre-calculated custom reference that use it against? in any case Im wondering if you can provide more details concerning this and if theres a reference sequence and a list of transcripts used for each gene that you can share would be great because everything that gets downloaded are in binary format and I cannot unpack and see.

Illumina / Nirvana

RefSeq Database Version #101