Illumina / Nirvana

The nimble & robust variant annotator
https://illumina.github.io/NirvanaDocumentation/
GNU General Public License v3.0
170 stars 44 forks source link

[Feature Request] Additional info during ClinVar parsing #83

Open olingerc opened 2 years ago

olingerc commented 2 years ago

Dear Nirvana team,

I'm sorry to mis-use the issue tracker for a feature request. I was not sure on how to best approach you.

Thanks for the detailed information on how you compile the ClinVar entries (HERE). Quite often we have the situation were we have many Clinvar entries on a position. Even reducing to isAlleleSepcific, it is sometimes difficult to get a good understanding on which entries are relevant to our variant. Specifically in the context of Clinvar entries that relate to variants at multiple sites (meaning they make only sense in case multiple variants are present = Haplotype). This information is stored in the Measure and GenotypeSet Fields. Would it be possible to at least include Measure? The example below from your documentation displays "single nucleotide variant" but we would be interested to identify cases for which this value would be "Haplotype" or "Genotype". Like this we could remove VCVs if they only make sense in case all variants are present.

<GenotypeSet Type="CompoundHeterozygote" ID="424709">
   <MeasureSet Type="Variant" ID="81">
       <Measure Type="single nucleotide variant" ID="15120">
        <SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38"
          AssemblyStatus="current" Chr="10" Accession="NC_000010.11" start="89222510"
          stop="89222510" display_start="89222510" display_stop="89222510" variantLength="1"
          positionVCF="89222510" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
        <SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25"
          AssemblyStatus="previous" Chr="10" Accession="NC_000010.10" start="90982267"
          stop="90982267" display_start="90982267" display_stop="90982267" variantLength="1"
          positionVCF="90982267" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
       </Measure>
   </MeasureSet>
 </GenotypeSet>
MichaelStromberg commented 2 years ago

Thanks Christophe! I brought this up with the team during this morning's stand-up meeting. We'll investigate how this is represented in the XML file so that we can provide useful haplotype information.

rajatshuvro commented 2 years ago

Hi @olingerc , Can you point me to a ClinVar record (RCV) that says 'Haplotype' or 'Genotype' in Measure?

It will be really helpful if you can describe a set of RCVs that are connected via this mechanism and which fields indicate the inter-relationship and how in more details. In short, I am asking for a description of your use case with real examples so that we better understand the feature you are requesting.

Thanks.

olingerc commented 2 years ago

Hi @rajatshuvro,

An example variant would be: 1-171076966-G-A

Nirvana gives me the following ClinVar list (v3.18.1) image

There are a total of 3 different (alleleSpecific) VCVs:

However, when opening the ClinVar pages of the two pathogenic variants: here and here it is obvious that they are only pathogenic in case they are coupled with another variant (Haplotype).

It would be very helpful if we had the "Haplotype" Info. It is stored in the MeasureSet element.

<MeasureSet Type="Haplotype" ID="217371" Acc="VCV000217371" Version="1">
</MeasureSet>

(extracted from the full xml). If I read your code correctly you almost read the info already here

Here are all possible values:

    <xs:simpleType name="Measuresettypelist">
        <xs:restriction base="xs:string">
            <xs:enumeration value="Gene"/>
            <xs:enumeration value="Variant"/>
            <xs:enumeration value="Haplotype"/>
            <xs:enumeration value="Phase unknown"/>
            <xs:enumeration value="Distinct chromosomes"/>
        </xs:restriction>
    </xs:simpleType>

A bonus would be having the info which other variant is in the haplotype. A quick fix would be extracting the title:

<ClinVarResult-Set>
   <ClinVarSet ID="101183654">
      <RecordStatus>current</RecordStatus>
         <Title>
            NM_006894.4(FMO3):c.[472G>A;560T>C] AND Trimethylaminuria
         </Title>
         <ReferenceClinVarAssertion ID="477812" DateLastUpdated="2022-06-24" DateCreated="2015-10-30">
...

within brackets, we see the identification of the second variant. Having the full list of variants would of course be nice as well, but I guess this would mean more changes to your code.

Thanks for considering the request!

Here is the corresponding line from a vcf file:

chr1    171076966   .   G   A   128.49  PASS    AC=2;AF=0.333;AN=6;DP=116;FS=4.083;MQ=250;MQRankSum=6.805;QD=1.4;ReadPosRankSum=3.267;SOR=0.346 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DN 0/0:28,0:0:24:63:PASS:.:.:0,63,945:.:0,74,260:. 0/1:13,16:0.552:29:48:PASS:7,8:6,8:85,0,49:50,6.9375e-05,52.227:128,0,54:.  0/1:33,30:0.476:63:48:PASS:14,12:19,18:84,0,50:49.643,6.8857e-05,53:84,0,124:Inherited
rajatshuvro commented 2 years ago

Thanks @olingerc . We are actively considering this a an upcoming feature.