BIH-CEI / ERKER2Phenopackets

A pipeline of ERKER data into the phenopackets data structure.
MIT License
4 stars 1 forks source link

Cannot determine variant coordinate from genomic interpretation #178

Closed frehburg closed 1 year ago

frehburg commented 1 year ago

ValueError: Cannot determine variant coordinate from genomic interpretation Hey @graefea, es gibt da leider einen bug anscheinend in unseren phenopackets. Ich denke es liegt an den Rohdaten aus dem ERKER. Auf jeden fall mag phenopackets irgendetwas an der Art wie wir die Koordinaten der Variante aufgeschrieben haben. Schau dir das bitte mal an und sie ob du daraus schlauwerden kannst.

LG,

Filip

P.S. Die Fehlermeldung:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 5
      3 patient_creator = configure_caching_patient_creator(hpo)
      4 phenopacket_input_folder = r'..\data\out\phenopackets\example-phenopackets-from-synthetic-data'
----> 5 patient_list = load_phenopacket_folder(pp_directory=phenopacket_input_folder, patient_creator=patient_creator)

File ~\anaconda3\envs\ERKER2Phenopackets\lib\site-packages\genophenocorr\preprocessing\_phenopacket.py:185, in load_phenopacket_folder(pp_directory, patient_creator)
    182     raise ValueError(f"No JSON Phenopackets were found in {pp_directory}")
    184 # turn phenopackets into patients using patient creator
--> 185 patients = [patient_creator.create_patient(pp) for pp in pps]
    187 # create cohort from patients
    188 return Cohort.from_patients(patients)

File ~\anaconda3\envs\ERKER2Phenopackets\lib\site-packages\genophenocorr\preprocessing\_phenopacket.py:185, in <listcomp>(.0)
    182     raise ValueError(f"No JSON Phenopackets were found in {pp_directory}")
    184 # turn phenopackets into patients using patient creator
--> 185 patients = [patient_creator.create_patient(pp) for pp in pps]
    187 # create cohort from patients
    188 return Cohort.from_patients(patients)

File ~\anaconda3\envs\ERKER2Phenopackets\lib\site-packages\genophenocorr\preprocessing\_phenopacket.py:108, in PhenopacketPatientCreator.create_patient(self, item)
    100 """Creates a Patient from the data in a given Phenopacket
    101 
    102 Args:
   (...)
    105     Patient: A Patient object
    106 """
    107 phenotypes = self._add_phenotypes(item)
--> 108 variants = self._add_variants(item)
    109 protein_data = self._add_protein_data(variants)
    110 return Patient(item.id, phenotypes, variants, protein_data)

File ~\anaconda3\envs\ERKER2Phenopackets\lib\site-packages\genophenocorr\preprocessing\_phenopacket.py:124, in PhenopacketPatientCreator._add_variants(self, pp)
    122 if hasattr(interp, 'diagnosis') and interp.diagnosis is not None:
    123     for genomic_interp in interp.diagnosis.genomic_interpretations:
--> 124         vc = self._coord_finder.find_coordinates(genomic_interp)
    125         variant = self._func_ann.annotate(vc)
    126         variants_list.append(variant)

File ~\anaconda3\envs\ERKER2Phenopackets\lib\site-packages\genophenocorr\preprocessing\_phenopacket.py:73, in PhenopacketVariantCoordinateFinder.find_coordinates(self, item)
     70 genotype = variant_descriptor.allelic_state.label
     72 if any(field is None for field in (chrom, ref, alt, genotype)):
---> 73     raise ValueError(f'Cannot determine variant coordinate from genomic interpretation {item}')
     74 return VariantCoordinates(chrom, start, end, ref, alt, len(alt) - len(ref), genotype)

**ValueError: Cannot determine variant coordinate from genomic interpretation** subject_or_biosample_id: "0"
interpretation_status: CONTRIBUTORY
variant_interpretation {
  variation_descriptor {
    id: "id:A"
    expressions {
      syntax: "hgvs"
      value: "NP_005903.2:p.(Val103Ile)"
    }
    expressions {
      syntax: "hgvs"
      value: "NM_005912.3:c.181G>T"
    }
    allelic_state {
      id: "GENO:0000135"
      label: "heterozygous"
    }
  }
}
frehburg commented 1 year ago

After this is fixed we can continue on #174

aslgraefe commented 1 year ago
ValueError: Cannot determine variant coordinate from genomic interpretation subject_or_biosample_id: "36"
interpretation_status: CONTRIBUTORY
variant_interpretation {
  variation_descriptor {
    id: "id:A"
    expressions {
      syntax: "hgvs"
      value: "NP_005903.2:p.(Arg165Gln)"
    }
    expressions {
      syntax: "hgvs"
      value: "NM_005912.3:c.335C>T"
    }
    allelic_state {
      id: "GENO:0000136"
      label: "homozygous"
    }
  }
}
aslgraefe commented 1 year ago
ValueError: Cannot determine variant coordinate from genomic interpretation subject_or_biosample_id: "41"
interpretation_status: CONTRIBUTORY
variant_interpretation {
  variation_descriptor {
    id: "id:A"
    expressions {
      syntax: "hgvs"
      value: "NP_005903.2:p.(Val103Ile)"
    }
    expressions {
      syntax: "hgvs"
      value: "NM_005912.3:c.496G>C"
    }
    allelic_state {
      id: "GENO:0000135"
      label: "heterozygous"
    }
  }
}
aslgraefe commented 1 year ago
ValueError: Cannot determine variant coordinate from genomic interpretation subject_or_biosample_id: "7"
interpretation_status: CONTRIBUTORY
variant_interpretation {
  variation_descriptor {
    id: "id:A"
    expressions {
      syntax: "hgvs"
      value: "NP_005903.2:p.(Ser85Gly)"
    }
    expressions {
      syntax: "hgvs"
      value: "NM_005912.3:c.751A>C"
    }
    allelic_state {
      id: "GENO:0000135"
      label: "heterozygous"
    }
  }
}
aslgraefe commented 1 year ago
ValueError: Cannot determine variant coordinate from genomic interpretation subject_or_biosample_id: "17"
interpretation_status: CONTRIBUTORY
variant_interpretation {
  variation_descriptor {
    id: "id:A"
    expressions {
      syntax: "hgvs"
      value: "NP_005903.2:p.(Ile251Leu)"
    }
    expressions {
      syntax: "hgvs"
      value: "NM_005912.3:c.253A>G"
    }
    allelic_state {
      id: "GENO:0000136"
      label: "homozygous"
    }
  }
}
aslgraefe commented 1 year ago

It seems that the genophenocorr expects a VCF format as variationDescriptor. However, we only have HGVS values

frehburg commented 1 year ago

@ielis Could you please take a look at this? These errors are being generated in the notebook MC4R.ipynb (on branch #174) by this code:

from genophenocorr.preprocessing import configure_caching_patient_creator
from genophenocorr.preprocessing import load_phenopacket_folder
patient_creator = configure_caching_patient_creator(hpo)
phenopacket_input_folder = r'..\data\out\phenopackets\example-phenopackets-from-synthetic-data'
patient_list = load_phenopacket_folder(pp_directory=phenopacket_input_folder, patient_creator=patient_creator)
ielis commented 1 year ago

Hi @graefea @frehburg I think your code is OK, the issue is that we do not yet have logic to go from HGVS annotation (e.g. NM_005912.3:c.253A>G) to genomic coordinates (e.g. chr18:60372097T>C). Do you think you could help out with this?

In the genophenocorr's framework, this would mean implementing genophenocorr.preprocessing.VariantCoordinateFinder to go from HGVS to genophenocorr.model.VariantCoordinates. Having VariantCoordinates will allow the rest of the framework to work.

We can use Ensembl's REST API to go from HGVS to genomic coordinate space. Here is an example query for the variant above, with a Python postprocessing to prettify the output:

curl 'https://rest.ensembl.org/vep/human/hgvs/NM_005912.3:c.253A>G?refseq=1' \
-H 'Content-type:application/json' | python3 -m json.tool > hgvs_MC4R_missense.json

We can obtain the bits required to construct VariantCoordinates from the output.

I prepared a stub and some unit tests in #75 (branch get-variant-coordinates-from-hgvs). Can you please have a look and see if you can help with this?

Thanks a lot..