charite / jannovar

Annotation of VCF variants with functional impact and from databases (executable+library)
http://jannovar.readthedocs.io/en/master/
Other
58 stars 35 forks source link

3' Shifting #551

Open lshimp opened 2 years ago

lshimp commented 2 years ago

Hello, I’m sorry if this is not the right place, but I have a question regarding Jannovar and 3’ shifting implementation. We have two variants that are very similar to each other but Jannovar 3’ shifts one and not the other. The variants in question:

Assembly: GRCh37 Assembly: GRCh37 Chromosome: 13 Chromosome: 11 Ref: AG Ref: TG Start: 328934621 Start: 108121798 Alt: A Alt: T Stop: 32893462 Stop: 108121799

Both variants delete a ‘G’ which is the last base of the exon. The first base of the intron is also a ‘G’, so we are expecting both variants to be 3’ shifted into the intron.

The first variant, Jannovar annotates as shifting into the intronic region with the following variant effects: { "splice_donor_variant", "coding_transcript_intron_variant"}

The second variant, Jannovar annotates as remaining in the exon with the following variant effects: { "frameshift_truncation", "splice_region_variant" }

Both variants are at the very end of the exon, but Jannovar only annotates one as 3’ shifting. We are wondering how Jannovar is annotating these variants and why only one is 3’ shifted?

Thanks!

holtgrewe commented 2 years ago

Hi, we implement the following algorithm based on the genome sequence and not the transcript sequence.

https://genome.sph.umich.edu/wiki/Variant_Normalization

I'd suggest writing down the genome sequence flanking left and right of this (e.g., via samtools faidx) and look at what is happening.

Does this help?

CarlosBorroto commented 2 years ago

Hi @holtgrewe, I'm looking into this issue with @lshimp. Thanks for the response.

We are familiar with the vt normalization, although I'm not sure this is playing a role here. At least I don't see anything about 3' shifting of HGVS nomenclature in the wiki you are linking.

Please, notice we are talking about this functionality as described in Jannovar docs

The HGVS Nomenclature for the description fo sequence variants requires that variants are to be shifted towards the 3’ end of transcripts in case of ambiguities. This is in partial conflict with the VCF standard which requires all variant calls to be shifted towards the 3’ end of the genome. In the case that Jannovar shifted the variants towards the 3’ end of the transcript, it will generate a INFO_REALIGN_3_PRIME information in the message field of the annotation (ANN field).

More exactly, we found that this variant returns this output from Jannovar 0.36 using the rest-server:

annotate-var/refseq/hg19/chr13/32893461/AG/A

[{
    "transcriptId": "NM_000059.4",
    "variantEffects": ["splice_donor_variant", "coding_transcript_intron_variant"],
    "isCoding": true,
    "hgvsProtein": "p.?",
    "hgvsNucleotides": "c.316+1del"
}]
hg19-chr13-32893461-AG-A

Notice how Jannovar is 3' shifting all the way into the intron.

However, we found this other very similar variant returns this output instead:

annotate-var/refseq/hg19/chr11/108121798/TG/T

[{
    "transcriptId": "NM_000051.4",
    "variantEffects": ["frameshift_truncation", "splice_region_variant"],
    "isCoding": true,
    "hgvsProtein": "p.(C536Ffs*7)",
    "hgvsNucleotides": "c.1607del"
}, {
    "transcriptId": "NM_001351834.2",
    "variantEffects": ["frameshift_truncation", "splice_region_variant"],
    "isCoding": true,
    "hgvsProtein": "p.(C536Ffs*7)",
    "hgvsNucleotides": "c.1607del"
}]
hg19-chr11-108121798-TG-T

As you can see, in this case Jannovar opted to not fully 3' shift into the intron.

So, our question is why is the 3' shifting feature choosing not to go in to the intronic sequence for the second variant?

holtgrewe commented 2 years ago

Hi thank you for the indepth look. I'll put this on my agenda in the next days.

CarlosBorroto commented 2 years ago

Much appreciated.

I didn't say, but we opted to build the database instead of downloading the ones provided. Nothing crazy, we are using all the refseq files you would expect. We just wanted to pin it to a specific build version to match the rest of our environment.

Sharing in case that could make a difference when trying to reproduce.

; HG19 from RefSeq
[hg19/refseq]
type=refseq
alias=MT,M,chrM
allowNonCodingNm=true
preferPARTranscriptsOnChrX=true
chromInfo=http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz
chrToAccessions=https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/Assembled_chromosomes/chr_accessions_GRCh37.p13
gff=https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/105.20201022/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
rna=https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/105.20201022/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_rna.fna.gz
faMT=https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=nuccore&report=fasta&id=251831106
holtgrewe commented 2 years ago

I had to adjust the download paths for gff and rna

[hg19/refseq]
type=refseq
alias=MT,M,chrM
allowNonCodingNm=true
preferPARTranscriptsOnChrX=true
chromInfo=http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz
chrToAccessions=https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/Assembled_chromosomes/chr_accessions_GRCh37.p13
gff=https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
rna=https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_rna.fna.gz
faMT=https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=nuccore&report=fasta&id=251831106

Also, I had to fix the downloaded chromInfo.txt.gz and remove the CRS sequence entry chrMT from chromInfo.txt.gz (ucsc added the rCRS at some point to hg19). Jannovar download still needs a fix for that.

Here is how to do it on the command line with and without 3' shifting.

WITH 3' SHIFTING

# java -jar jannovar-cli/target/jannovar-cli-0.40-SNAPSHOT.jar annotate-pos -d custom-download/hg19_refseq.ser -c 'chr13:32893461AG>A' -c 'chr11:108121798TG>T'  --show-all
Options
JannovarAnnotatePosOptions [genomicChanges=[chr13:32893461AG>A, chr11:108121798TG>T], toString()=JannovarAnnotationOptions [useThreeLetterAminoAcidCode=false, nt3PrimeShifting=false, showAll=true, databaseFilePath=custom-download/hg19_refseq.ser, toString()=JannovarBaseOptions [reportProgress=true, httpProxy=null, httpsProxy=null, ftpProxy=null, verbosity=1]]]
Deserializing transcripts...
INFO Deserializing JannovarData from custom-download/hg19_refseq.ser
INFO Deserialization took 4.26 sec.
#change effect  hgvs_annotation messages
chr13:32893461AG>A      SPLICE_DONOR_VARIANT+CODING_TRANSCRIPT_INTRON_VARIANT   BRCA2:NM_000059.4:c.316+1del:p.?        
chr11:108121798TG>T     FRAMESHIFT_TRUNCATION+SPLICE_REGION_VARIANT,FRAMESHIFT_TRUNCATION+SPLICE_REGION_VARIANT ATM:NM_000051.4:c.1607del:p.(C536Ffs*7),ATM:NM_001351834.2:c.1607del:p.(C536Ffs*7)      .;.

NO 3' SHIFTING

# java -jar jannovar-cli/target/jannovar-cli-0.40-SNAPSHOT.jar annotate-pos -d custom-download/hg19_refseq.ser -c 'chr13:32893461AG>A' -c 'chr11:108121798TG>T'  --show-all --no-3-prime-shifting
Options
JannovarAnnotatePosOptions [genomicChanges=[chr13:32893461AG>A, chr11:108121798TG>T], toString()=JannovarAnnotationOptions [useThreeLetterAminoAcidCode=false, nt3PrimeShifting=true, showAll=true, databaseFilePath=custom-download/hg19_refseq.ser, toString()=JannovarBaseOptions [reportProgress=true, httpProxy=null, httpsProxy=null, ftpProxy=null, verbosity=1]]]
Deserializing transcripts...
INFO Deserializing JannovarData from custom-download/hg19_refseq.ser
INFO Deserialization took 4.27 sec.
#change effect  hgvs_annotation messages
chr13:32893461AG>A      SPLICE_DONOR_VARIANT+CODING_TRANSCRIPT_INTRON_VARIANT   BRCA2:NM_000059.4:c.316+1del:p.?        INFO_REALIGN_3_PRIME
chr11:108121798TG>T     FRAMESHIFT_TRUNCATION+SPLICE_REGION_VARIANT,FRAMESHIFT_TRUNCATION+SPLICE_REGION_VARIANT ATM:NM_000051.4:c.1607del:p.(C536Ffs*7),ATM:NM_001351834.2:c.1607del:p.(C536Ffs*7)      .;.

So it looks like 3' shifting is not properly hooked up, adding #552/#553 to fix this. Now we get proper output when disabling shifting.

# java -jar jannovar-cli/target/jannovar-cli-0.40-SNAPSHOT.jar annotate-pos -d custom-download/hg19_refseq.ser -c 'chr13:32893461AG>A' -c 'chr11:108121798TG>T'  --show-all --no-3-prime-shifting
Options
JannovarAnnotatePosOptions [genomicChanges=[chr13:32893461AG>A, chr11:108121798TG>T], toString()=JannovarAnnotationOptions [useThreeLetterAminoAcidCode=false, nt3PrimeShifting=false, showAll=true, databaseFilePath=custom-download/hg19_refseq.ser, toString()=JannovarBaseOptions [reportProgress=true, httpProxy=null, httpsProxy=null, ftpProxy=null, verbosity=1]]]
Deserializing transcripts...
INFO Deserializing JannovarData from custom-download/hg19_refseq.ser
INFO Deserialization took 4.63 sec.
#change effect  hgvs_annotation messages
chr13:32893461AG>A  FRAMESHIFT_VARIANT+SPLICE_REGION_VARIANT    BRCA2:NM_000059.4:c.316del:p.(G106Efs*15)   .
chr11:108121798TG>T FRAMESHIFT_TRUNCATION+SPLICE_REGION_VARIANT,FRAMESHIFT_TRUNCATION+SPLICE_REGION_VARIANT ATM:NM_000051.4:c.1607del:p.(C536Ffs*7),ATM:NM_001351834.2:c.1607del:p.(C536Ffs*7)  .;.

The next thing is to bow to the hypnotoad^H variantvalidator.org. It is the most reliable tool and the Berlin genetics department's SOP for exome analysis actually says that variants identified should get their final description from VariantValidator.

GRCh37:13:32893461:AG:A   | NM_000059.4:c.316+1del     | NP_000050.3:p.?

GRCh37:11:108121798:TG:T  | NM_000051.4:c.1607+1del    | NP_000042.3:p.?
                          | NM_001351834.1:c.1607+1del | NP_001338763.1:p.? 

So this confirms your bug report. Let's dive a bit deeper.

The relevant code is here.

        // Shift the GenomeChange if lies within precisely one exon.
        if (so.liesInExon(change.getGenomeInterval())) {
            try {
                // normalize amino acid change and add information about this into {@link messages}
                this.change = GenomeVariantNormalizer.normalizeGenomeChange(transcript, change,
                    projector.genomeToTranscriptPos(change.getGenomePos()));
                if (!change.equals(this.change))
                    messages.add(AnnotationMessage.INFO_REALIGN_3_PRIME);
            } catch (ProjectionException e) {
                throw new Error("Bug: change begin position must be on transcript.");
            }
        } else {
            this.change = change;
        }

It might not be obvious from these lines, but the code 3'-shifts the variant based on the transcript sequence. Jannovar does not have access to the genome sequence itself so our assumption always was that users perform genome-wise vt-like 3' shifting and the code above from jannovar-core the only has to apply the 3' prime rule to the transcript and CDS positions.

In other words, the code in jannovar-core cannot 3' shift into the introns as this is not known to the code. The annotate-pos, and annotate-csv commands do not know about the reference sequence, same as the REST API server.

For my use cases, this never has been a problem as I run all my VCFs through bcftools norm first which does vt style variant normalization and genome-based 3' shifting. Of course, I see the point of having this.

What do you think? If you agree with my reasoning my next step would be to create a Github issue for adding an (optional) argument to provide a FAI-indexed FASTA file to annotate-pos, annotate-csv, and rest-server that first does genome-based 3' shifting of the variant when given and then further 3' shifts based on the transcript for the c. position.

CarlosBorroto commented 2 years ago

I might be getting something wrong, but I don't think bcftools norm/vt normalization do 3' shifting. They do left-aligning to the reference forward strand, which translates into "5' shifting" in the forward strand and "3' shifting" in the reverse strand. To the best of our knowledge these variants are "vt normalized" already.

I had mentioned to our team a reason why Jannovar couldn't do the 3' shifting outside of the exon sequence could be because of the lack of full reference sequence. I agree I don't see any other option but to make that available.

We do wonder how is Jannovar able to fully 3' shift GRCh37:13:32893461:AG:A.

holtgrewe commented 2 years ago

You don't get anything wrong, I do.

Let me draft up a couple of tests and then a fix to the code so we can decide based on actual examples.

Thanks!

dmb107 commented 2 years ago

Hi, @holtgrewe, I work on @CarlosBorroto's team and had some follow up information to share as well as a question.

Before contacting you, we were trying to investigate any differences between these two variants to determine why one variant is 3' shifted into the intron while the other variant is not. We noticed a difference in the first nucleotide of the next coding exon. I am unsure if this is related, but I wanted to share in case it was helpful.

Variant annotate-var/refseq/hg19/chr13/32893461/AG/A is the variant that 3' shifts into the G of the intron. The next coding sequence downstream of this position starts with a G. Screen Shot 2022-07-11 at 2 45 00 PM

Variant annotate-var/refseq/hg19/chr11/108121798/TG/T is the variant that does not 3' shift into the intron. The next coding sequence downstream of this position does not start with a G. Screen Shot 2022-07-11 at 2 46 16 PM

We found two other variants that follow this pattern, though this finding can be completely coincidental. However, it did remind us of an HGVS rule that I now wonder whether Jannovar implements. The rule is:

for all descriptions the most 3’ position possible of the reference sequence is arbitrarily assigned to have been changed (3’rule) exception: deletions around exon/exon junctions when identical nucleotides flank the junction (see Numbering); when ..GAT gta..//..cag TCA.. changes to ..GA_ gta..//..cag TCA.., based on a coding DNA reference sequence the variant is described as LRG_199t1:c.3921del (NC_000023.10:g.32459297del) and not as c.3922del (which would translate to g.32456507del)

LRG199t1:c.3921del the deletion of the T nucleotide at the exon/exon border in the sequence ..GAT gta..//..cag TCA.. changing to ..GA gta..//..cag TCA.. NOTE : according to an exception of the 3’rule the variant (NC_000023.10:g.32459297del) is not described as c.3922del since this would shift the position of the variant to the next exon (c. 3922 linking to g.32456507) (see exception in Numbering and see Q&A)

https://varnomen.hgvs.org/recommendations/DNA/variant/deletion/

Does Jannovar implement the 3' shifting exception to prevent a variant from shifting into the next exon? I found a variant that is pushed into the downstream exon, but if I am understanding the HGVS exception correctly, this should not occur. This variant is left aligned to chr1:76199309. Screen Shot 2022-07-11 at 2 22 49 PM

The result of annotate-var/refseq/hg19/chr1/76199309/TG/T:

[
    {
        "transcriptId": "NM_000016.6",
        "variantEffects": [
            "frameshift_variant",
            "splice_region_variant"
        ],
        "isCoding": true,
        "hgvsProtein": "p.(Q130Kfs*20)",
        "hgvsNucleotides": "c.387del"
    },
    {
        "transcriptId": "NM_001127328.3",
        "variantEffects": [
            "frameshift_variant",
            "splice_region_variant"
        ],
        "isCoding": true,
        "hgvsProtein": "p.(Q134Kfs*20)",
        "hgvsNucleotides": "c.399del"
    },
    {
        "transcriptId": "NM_001286042.2",
        "variantEffects": [
            "frameshift_variant",
            "splice_region_variant"
        ],
        "isCoding": true,
        "hgvsProtein": "p.(Q94Kfs*20)",
        "hgvsNucleotides": "c.279del"
    },
    {
        "transcriptId": "NM_001286043.2",
        "variantEffects": [
            "frameshift_variant",
            "splice_region_variant"
        ],
        "isCoding": true,
        "hgvsProtein": "p.(Q163Kfs*20)",
        "hgvsNucleotides": "c.486del"
    },
    {
        "transcriptId": "NM_001286044.2",
        "variantEffects": [
            "five_prime_utr_intron_variant"
        ],
        "isCoding": true,
        "hgvsProtein": "p.(=)",
        "hgvsNucleotides": "c.-100+703del"
    }
]

From the Jannovar cdot output, you can see this variant is annotated in the codon of the next exon: Screen Shot 2022-07-11 at 2 33 18 PM

If I am understanding correctly, then I don't expect the variant to be pushed into the downstream exon. If you agree, I'm happy to open an independent issue, but wanted to start the conversation here in case these events are actually related.

holtgrewe commented 2 years ago

@dmb107 thank you for reporting. I think your understanding is right. Sadly, Jannovar does not implement the rule yet. I think this is a good place to collect the 3' shifting related issues.

dmb107 commented 2 years ago

Thank you for the quick responses! We very much appreciate it.

smw1414 commented 2 years ago

Jannovar applied 3' shifting and classified the following variants as splice_donor_variant.

hg38
chr5:37010219TA>T    (c.4560+1del)
chr13:100102955GA>G  (c.183+1del)
chr3:9441736CA>C     (c.959+1del)
chr9:37432134CTG>C   (c.865+1_865+2del)

However, the Variantvalidator classified as splice_region_variant according to HGVSc (no +1 or +2). I believed that this is related to the problem of 3' shifting as well.