d3b-center / ticket-tracker-OPC

A repo to generate and track tickets for ped OT
2 stars 0 forks source link

v11 Disease IDs incompatible with MTP #418

Closed zdorman closed 2 years ago

zdorman commented 2 years ago

What data file(s) does this issue pertain to?

v11 Somatic Alterations tables for MTP:

What release are you using?

v11

Put your question or report your issue here.

3 disease IDs (EFO IDs) within the v11 Somatic Alterations files are not compatible with the version of MTP in use and cannot be loaded unless changed. I see this as divisible into 2 issues: short-term and long-term. In the short-term, we could consider remapping the data to compatible IDs. In the long-term, we need to consider how both CHoP and FNL (via OT) are using disease ontologies to come up with a sustainable solution.

The following diseases will not load into MTP (which uses EFO v3.40.0). The short-term fix is to address these directly:

  1. MONDO_0006085 Wilms tumor
    • Could we use MONDO_0019004 kidney Wilms tumor instead? That is what was used in v10 and still works
    • This ID is present in the newer EFO v3.45.0 and will eventually work in future MTP versions
  2. MONDO_0016731 Desmoplastic infantile astrocytoma and ganglioglioma
    • This was mentioned in #236 as one of the diseases with a blank disease ID within the v10 data
    • This ID is not present in the newer EFO v3.45.0
  3. Orphanet_251601 Diffuse fibrillary astrocytoma
    • This one is new to me
    • This ID is not present in the newer EFO v3.45.0

The long-term fix is to ensure that new releases don't repeat the issue.

MTP Disease IDs MTP ingests the OT disease database as-is. In order to load data as evidence into MTP/OT, the evidence must contain a disease ID present in the MTP/OT disease database. For each OT release, OT generates their disease database by ingesting efo_otar_slim.owl, a custom EFO ontology file curated for OT and released by EFO. The customization removes some high-level diseases from the "standard" EFO in order to align it to OT's needs. OT then enriches their disease data with clinical signs and symptoms from HPO and MONDO.

The key is that the disease databases of OT and MTP are not "live updated" with EFO. EFO publishes monthly releases of ontology updates. Approximately every two months, OT ingests the previous month's EFO release to build and publish their newest release. MTP then updates to the newest OT at a further delay. Because of this, we need to base the disease IDs used in OpenPedCan data upon version of EFO expected to be used in the next release of MTP.

Ultimately, using the OT disease database publicly available via FTP will be the best way to QA check whether any diseases present in OpenPedCan releases will not work with MTP.

chinwallaa commented 2 years ago

@zdorman thanks. To clarify, if the MTP data release cadence is 2 months from the EFO release- will that coordinate with the MTP portal refresh. Does the further delay you mention above add days/weeks/months ?

zdorman commented 2 years ago

@chinwallaa The total update delay between EFO release and MTP release is closer to 2-5 months: EFO -> [1-2 months] -> OT -> [1-3 months] -> MTP I don't know the time it takes for the NCIT -> MONDO -> EFO process up until that point.

We're working on building a good cadence and improving OT->MTP update process. Meshing newer OT versions with MTP is not straightforward, so automatically updating MTP to each new OT has not been a foregone conclusion thus far.

On a related note, OT 22.09 is live as of this morning. They've updated their configuration settings to make ensembl and efo versions much more clear (107 and v3.45.0, respectively). While we don't have the formal go-ahead, I expect that the next release of MTP will use OT 22.09 (and these ID versions).

jharenza commented 2 years ago

Question @zdorman - do we need to rely on OT? Is there a way for FNL to pull the latest from EFO/MONDO on a quicker scheduled basis (ie closer to real-time)?

zdorman commented 2 years ago

@chinwallaa @jharenza I've been discussing that with the FNL team today. The consensus is that misaligning the code and data within the version of OT that MTP is based upon is too much of a stability risk. Newer EFO terms will eventually make their way to MTP, just at a delay that we'll need to coordinate around.

The short-term ask is to resubmit the v11 data with IDs that match EFO v3.40.0 (aka OT 22.04) so that we can ingest. The Wilms tumor evidence in particular affects >30K target-disease evidence combinations that we'd rather not be forced to drop. I'm not sure if the other two diseases have straightforward mappings.

The long-term ask is to expect that the next OpenPedCan release should be compatible with OT 22.09 (ensembl 107 and EFO v3.45.0).

What are you currently using as the resource to assign disease IDs? As mentioned in the main post, I'm not seeing MONDO_0016731 or Orphanet_251601 in recent or future EFO releases.

jharenza commented 2 years ago

Hi @zdorman. It might be worth us scheduling a technical call for this, as we want to make sure we are able to provide the latest mappings for these children and pharma, researchers, etc as quickly as possible so that we can accelerate treatments. We currently use an automated pull from the ontology databases as in this module. That is then manually reviewed and updated as needed.

I see the risk ahead being that we are actively working with MONDO to ingest new cancer groups and subtypes and want to make sure these filter down appropriately as we will start integrating them into plots as soon as we are able. Not having the granularity of subtypes will make informing research and clinical trials a bit difficult. As the first few passes, we have stayed at a broader cancer group level, but within the next few iterations, we need to start pushing toward subtype level cohorts/information.

I suppose another alternative could be to see if we can accelerate the OT workstream, perhaps via someone here or at FNL actively contributing to a PR or maybe we can discuss with them if they can do this in a monthly fashion, which would put us at less of a delay. This would enable us to maintain current EFO ids and we can wait until MTP is updated to ingest those.

Something else which might also be an option is if MTP is going to schedule releases every say, 6 months, to ensure the latest OT ontologies ids are pulled before this release such that you will always be up to date with what OpenPedCan is giving. There has been a lag between our table generation and MTP release for QC etc, so doing the updates during that lag could make sense. Thoughts?

Regarding ENSEMBL 107 - I believe this is using GENCODE 39 - we have updated our gene ID mapping using GENCODE 39, as every RNA-Seq dataset with v12+ will also have this mapping.

chinwallaa commented 2 years ago

@zdorman do you have an estimate of when MTP will have the below version ingested - This is what we are using now w/ v12 and can plan the release cycle with MTP accordingly. (@jharenza )

https://www.ebi.ac.uk/ols/ontologies/efo Ontology IRI: http://www.ebi.ac.uk/efo/efo.owl Version IRI: http://www.ebi.ac.uk/efo/releases/v3.45.0/efo.owl Ontology ID: efo Version: 3.45.0 Number of terms: 36350 Last loaded: Fri Aug 26 06:44:31 BST 2022

zdorman commented 2 years ago

@chinwallaa Thanks for pulling the release together. I'll check with the team to get an estimate for long-term solutions. We're also still discussing internally about how to handle the short-term ID issue within the v11 data and whether/how to release the v11 to the current build of MTP as planned. More on that to come.

@jharenza Understood on your points above. We want to balance the necessity of timely, useful releases with that of stability and sustainability. More to be discussed there.

Thanks for linking the efo-mondo-mapping. It looks like this queries the full EFO, MONDO, and Orphanet ontologies to get IDs - I think this is the source of two of the incompatible IDs listed in the issue above. EFO contains some terms ingested from other ontologies (and uses IDs from those ontologies, such as MONDO_0019004). However, EFO does not ingest all terms from those ontologies (I don't know how they make the decisions).

MTP/OT can only accept terms within the EFO ontology. Any MONDO or Orphanet terms can be ingested only if they are also within EFO. I'd recommend using the efo_slim.owl as published on http://www.ebi.ac.uk/efo/releases/v3.45.0/efo_slim.owl instead of the OLS query. If the OLS query proves the better option, then I'd recommend setting the "ontologies" parameter to only "EFO" .

As for ENSEMBL 107: If I'm reading https://www.gencodegenes.org/human/releases.html correctly, then 107 will be using GENCODE 41 (and Genome assembly GRCh38.p13).

chinwallaa commented 2 years ago

@zdorman With the v11 issue you identified with the CHOP P30 data, wrt KMT2C gene - the v11 data is still going thru QC (issue has been tracked down to the source derived data provided, and not the OpenPedCan/MTP methods/analysis ) - we can discuss during the next team mtg whether we should continue with v11 for an MTP release, or wait until a v12 release (Nov/Dec).

zdorman commented 2 years ago

@chinwallaa Thanks for the update on KMT2C.

I realize my response to your EFO versioning question was a bit ambiguous - sorry about that. To clarify: We're still deciding when we will release a version of MTP that is updated to OT 22.09. (unofficially, it's on the scale of weeks-to-months). But that release will use the EFO v3.45.0 version that you identified. This will be the correct version to use for v12. (Technically MTP/OT will use the v3.45.0 efo_slim, but I think the differences between the efo and efo_slim are likely negligible for our disease naming purposes).

chinwallaa commented 2 years ago

Thanks @zdorman - + @sangeetashukla @taylordm as FYI

wintercl commented 2 years ago

@chinwallaa Hi Asif. Regarding your post above: "With the v11 issue you identified with the CHOP P30 data, wrt KMT2C gene - the v11 data is still going thru QC", By QC, do you mean on your end (my assumption) or are you referring to the DGD's investigation of the issue? So DGD would make corrections to the data, and then you will reprocess the data as well, correct? Do you have an estimate of how long the v11 QC will take?

chinwallaa commented 2 years ago

By QC, do you mean on your end (my assumption)

 Completed 

DGD's investigation of the issue

   In progress

DGD would make corrections to the data, and then you will reprocess the data as well, correct?

    Yes

Do you have an estimate of how long the v11 QC will take?

   1-2 weeks for QC, and then couple more to make corrections and then we need to re-process, either as v11 update, or with next v12 release 
jharenza commented 2 years ago

@zdorman we would not make any updates to v11, but plan to have this in the v12 release