BIH-CEI / ERKER2Phenopackets

A pipeline of ERKER data into the phenopackets data structure.
MIT License
4 stars 1 forks source link

Daniel's comments #125

Closed aslgraefe closed 1 year ago

aslgraefe commented 1 year ago

1) Right now this handles one variant per case/phenopacket, right? However, will this always be the case for the ERKER format? How about diseases with autosomal recessive mode of inheritance, where a pair of heterozygous variants can be disease causing (compound heterozygosity)? In that case, we need to create 2 genomic interpretations, one per variant, and I am not sure the current architecture would allow that.

2) genomic_interpretation_variant = GenomicInterpretation( subject_or_biosample_id=phenopacket_id, interpretation_status=interpretation_status, variant_interpretation=variant_interpretation ) I believe we should only keep the genomic interpretation with the variant data (above). In this setting, where we have short variants (single nucleotide polymorphisms (SNP) or short insertions/deletions), we can infer the gene from the variant data. In general, the gene field of the GenomicInterpretation should be used if we do not have the variant data, for instance, to represent the candidate or the most likely causal gene.

3) if we are including the diagnosis in the phenopacket, then we must add the disease field - a required field of the Diagnosis element. The disease is an ontology class, so we need an id (e.g. MONDO:0019115) and a label (e.g. obesity due to melanocortin 4 receptor deficiency) as in http://purl.obolibrary.org/obo/MONDO_0019115. Adam, Filip, please select the disease code! genomic_interpretations=[ please delete the comment and the line below if you agree with dropping the gene interpretation. -> # genomic_interpretation_gene,

4) Generates a random str like 4d062c1e-ea58-4ad9-8307-b7d07fe6b0ab consider setting progress_status (https://phenopacket-schema.readthedocs.io/en/latest/interpretation.html)

progressstatus

Right now it is set to the default value = UNKNOWN_PROGRESS. However, we do have the diagnosis, which is inconsistent with the unknown progress. I think this is a clinical question. Adam, can we consider the variants as causal/contributory (per https://phenopacket-schema.readthedocs.io/en/latest/genomic-interpretation.html#interpretationstatus) to the disease? If yes, then we can make the diagnosis and set the progress status, right?

ielis commented 1 year ago

Yes, I believe the variant data is superior to that of a gene.

aslgraefe commented 1 year ago

1) Right now this handles one variant per case/phenopacket, right? However, will this always be the case for the ERKER format? How about diseases with autosomal recessive mode of inheritance, where a pair of heterozygous variants can be disease causing (compound heterozygosity)? In that case, we need to create 2 genomic interpretations, one per variant, and I am not sure the current architecture would allow that.

ANSWER: In the new version of the ERKER_v1.6 we allow these choices for the zygosity: Homozygous // (simple) Heterozygous // Compound heterozygous // Double heterozygous // Hemizygous // Not recorded

  • we also allow multiple genomic interpretations: 3 clinical relevant variants and 5 genetic side variants. Therefore, we could allow a compound heterozygous patient to be captured with both clinical relevant variants.
  • However, for now the ERKER form only allows zygosity of the main diagnosis i.e. the main disease causing mutation. For this analysis with MC4R-deficiency this will be sufficient and we will stay with v1.6.
  • as much as I understand correctly, you would suggest adding zygosity to all genetic clinical relevant variants and side variants, right? I noted this and we could implenet this into our version v1.7.
aslgraefe commented 1 year ago

2) genomic_interpretation_variant = GenomicInterpretation( subject_or_biosample_id=phenopacket_id, interpretation_status=interpretation_status, variant_interpretation=variant_interpretation ) I believe we should only keep the genomic interpretation with the variant data (above). In this setting, where we have short variants (single nucleotide polymorphisms (SNP) or short insertions/deletions), we can infer the gene from the variant data. In general, the gene field of the GenomicInterpretation should be used if we do not have the variant data, for instance, to represent the candidate or the most likely causal gene.

ANSWER: we dropped out the 'gene' field of GenomicInterpretation

aslgraefe commented 1 year ago

3) if we are including the diagnosis in the phenopacket, then we must add the disease field - a required field of the Diagnosis element. The disease is an ontology class, so we need an id (e.g. MONDO:0019115) and a label (e.g. obesity due to melanocortin 4 receptor deficiency) as in http://purl.obolibrary.org/obo/MONDO_0019115. Adam, Filip, please select the disease code! genomic_interpretations=[ please delete the comment and the line below if you agree with dropping the gene interpretation. -> # genomic_interpretation_gene,

ANSWER: we changed the disease code and label to the Diagnosis (Interpretation) block.

aslgraefe commented 1 year ago

4) Generates a random str like 4d062c1e-ea58-4ad9-8307-b7d07fe6b0ab consider setting progress_status (https://phenopacket-schema.readthedocs.io/en/latest/interpretation.html) #progressstatus Right now it is set to the default value = UNKNOWN_PROGRESS. However, we do have the diagnosis, which is inconsistent with the unknown progress. I think this is a clinical question. Adam, can we consider the variants as causal/contributory (per https://phenopacket-schema.readthedocs.io/en/latest/genomic-interpretation.html#interpretationstatus) to the disease? If yes, then we can make the diagnosis and set the progress status, right?

ANSWER: we adapted the interpretation_status to 'CONTRIBUTORY' and the progress_status to 'SOLVED'.

aslgraefe commented 1 year ago

@ielis (& @frehburg) please find the answers and implementations to each comments. Thank you for your comments!

ielis commented 1 year ago

Regarding my point 1) above - I only see adding one genomic interpretation to the Interpretation > Diagnosis element here. Do you plan to keep just one variant for this dataset? That's OK if there truly is just 1 variant per patient. However, I don't think this will necessarily work for other datasets, e.g. AR diseases with 2 compound heterozygous variants. Most likely, the entire function _map_interpretation will have to be updated to accommodate the change of the variant cardinality from 1 per interpretationto1..*`.

as much as I understand correctly, you would suggest adding zygosity to all genetic clinical relevant variants and side variants, right? I noted this and we could implenet this into our version v1.7.

Regarding adding zygosity - you're setting the allelic_state field for the VariationDescriptor, which is the right thing to do, so you should be all set there. No changes needed. I see this in the phenopackets:

"allelicState": {
   "id": "GENO:0000135",
   "label": "heterozygous"
}

which is correct! :)

aslgraefe commented 1 year ago

Dear Daniel, thank you for your answer. Well, we sometimes also have two (once three) mutations per patient - but considered these to be within the same genomicInterpretation as different expressions like so:

"genomicInterpretations": [
          {
            "subjectOrBiosampleId": "21",
            "interpretationStatus": "CONTRIBUTORY",
            "variantInterpretation": {
              "variationDescriptor": {
                "id": "id:A",
                "expressions": [
                  {
                    "syntax": "hgvs",
                    "value": "NP_005903.2:p.Val103Ile"
                  },
                  {
                    "syntax": "hgvs",
                    "value": "NM_005912.3:c.307G>A"
                  }
                ],
                "allelicState": {
                  "id": "GENO:0000135",
                  "label": "heterozygous
      }
              }
            }
          }
        ]

Is this still valid? And regarding the Geno-Pheno-correlating, does the tool allow the multiple expressions within one genomicInterpretation?

All the best, Adam

ielis commented 1 year ago

Hi Adam, no, I think this is not the right way to do this. First, the expressions should represent descriptions of the same variant in different notations/nomenclatures.

For instance, the variant with the following genomic coordinates:

can be also described using HGVS syntax with respect to the coding DNA of the gene (c.):

NM_000088.3:c.589G>T

or with respect to the genomic reference sequence (g., also HGVS):

NG_007400.1:g.8638G>T

NG_007400.1 is a GenBank entry with COL1A1 sequence

or SPDI notation

GRCh38:17:50198002:C:A

So, expressions represent different notations of one variant, not several variants in the subject/patient.

You must encode each variant into its own GenomicInterpretation element and set it into genomic_interpretations field of the corresponding Diagnosis.

And regarding the Geno-Pheno-correlating, does the tool allow the multiple expressions within one genomicInterpretation?

Genophenocorr considers variants from all genomic interpretations of the phenopacket in the analysis.

aslgraefe commented 1 year ago

Hi Daniel,

Thank you very much for your comment and explanation.

You must encode each variant into its own GenomicInterpretation element and set it into 
genomic_interpretations field of the corresponding Diagnosis.

I will implement this tomorrow morning right away.

Genophenocorr considers variants from all genomic interpretations of the phenopacket in the analysis.

Perfect, thank you. Can you talk tomorrow any time for an introduction?

ielis commented 1 year ago

Hi Adam, the repo is at https://github.com/monarch-initiative/genophenocorr, we recently made it public.

We are working on making a PyPi release in these days, and there are online docs as well as some example notebooks (e.g. for analysis of KBG variants here).

However, the docs and notebooks are still work in progress. So, feel free to look at the code, but please be aware that the material may be a bit outdated. We'll be fixing this in the coming days..

aslgraefe commented 1 year ago

Regarding adding zygosity - you're setting the allelic_state field for the VariationDescriptor, which is the right thing to do, so you should be all set there. No changes needed. I see this in the phenopackets:

Hi Daniel,

we have updated the form to version v1.7. As we sometimes have more variants, each variant has its own zygosity now.

aslgraefe commented 1 year ago

Hi Daniel,

You must encode each variant into its own GenomicInterpretation element and set it into genomic_interpretations field of the corresponding Diagnosis.

We have implemented this, so that each variant has its own GenomicInterpretation element with its mutations with zygosity.

Thank you!