geneontology / neo

noctua entity ontology
9 stars 2 forks source link

ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78

Open pgaudet opened 2 years ago

pgaudet commented 2 years ago

This would vastly reduce the number of ChEBI terms to choose from, and would make sure we use the 7.3 forms.

Thanks, Pascale

@kltm

kltm commented 2 years ago

It looks like there is already a limited set from chebi coming in to the "NEO" build (~23k) versus the regular GO release build (~177k). Maybe that's what's already coming in on imports? I'm not sure there is actually anything in the current build process to shave that down more, as we're only examining GPIs and GAFs to produce this.

cmungall commented 2 years ago

The correct place to handle this is upstream. go-lego.owl imports go-plus, which uses a chebi_import detemined by the editors file.

This is likely both too small (doesn't have any terms that have not been used in the ontology) and too large (includes protonation variants).

Unfortunately simply limiting to the 7.3 forms will have issues since the hierarchy for any one protonation form is often incomplete, and you need all branches with the GCIs to get a complete hierarchy (if that sounds strange and complex, that is because it is).

My preference would be to first scope out more complete requirements for what we want and don't want in chebi and then prioritize a project based on this. For example, in addition to having a canonical protonation state, we want the labels to be intuitive and searchable, we want to ensure that curators are consistent in the level they choose (e.g. L vs D form), and we want to simplify the process of using CHEBI in the ontology, and simplify things for users who might want to use CHEBI and GO together.

We can explore a hack in go-lego that subtracts from the chebi terms in go-plus but I think this will lead to marginal gain at high complexity cost.

pgaudet commented 2 years ago

This is the file that RHEA uses: https://ftp.expasy.org/databases/rhea/tsv/chebiId_name.tsv (about 11k)

It would be useful to know how many chemicals we'd be missing if we used this.

Thanks, Pascale

deustp01 commented 2 years ago

t would be useful to know how many chemicals we'd be missing if we used this.

Once I figure out how to do it, I will check the RHEA list against all the ChEBI ID's in Reactome. (If someone reading this knows how, that would be great!)

kltm commented 2 years ago

@deustp01 Is there a good source for that information? If I just munge through reacto.owl grep -oh 'CHEBI_[0-9]*' reacto.owl | sort | uniq | sed 's/_/:/' > reacto_chebi.txt With @pgaudet 's file above, I can extract: grep -oh 'CHEBI:[0-9]*' chebiId_name.tsv | sort |uniq > reacto_rhea.txt File sizes compare at:

sjcarbon@moiraine:/tmp$:) wc -l reacto_*
  1978 reacto_chebi.txt
 10226 reacto_rhea.txt
 12204 total
deustp01 commented 2 years ago

@kltm The attached tab-delimited text file contains entries for the reference form of every chemical known to Reactome (including un-released ones), one row for each chemical. ("Reference" means the information we get from an external reference resource, almost always ChEBI, and which we use to construct "working" instances by adding subcellular location information - so there's only one water reference but many working forms differing by location.) The first entry in each row is the chemical's name; the second is its identifier in the reference resource.

If you just omitted all the rows whose identifier does NOT start with ChEBI, that would be OK - there aren't many, and basically if we can't specify something well enough to get a ChEBI identifier for it, it's not well enough specified for GO-CAM either.

Adding @ukemi for a sanity check.

Reactome_ChEBI_list.txt

kltm commented 2 years ago

@deustp01 Processing that file in a similar way: grep -oh 'ChEBI:[0-9]*' Reactome_ChEBI_list.txt | sort | uniq | sed 's/ChEBI/CHEBI/' > reacto_reacto.txt

sjcarbon@moiraine:/tmp$:) wc -l reacto_rhea.txt reacto_reacto.txt 
 10226 reacto_rhea.txt
  7071 reacto_reacto.txt

So, like 3k short. Diff output looks like: https://gist.github.com/kltm/f7294fcf771cf00eada192b9734ac8ed (~10k lines)

ukemi commented 2 years ago

I think one question that remains is how to handle entities from imported sources like this and build a robust and complete entity ontology for use in models. In this case Reactome is the straw man, but there have been proposals to do this with other resources as well. I think (correct me if I am wrong) that the plan for Reactome proteoforms and complexes is to move towards using PRO. So there is an ontology for that. We should be able to distinguish location for the Reactome entities using the PRO ids, existing relations and GO cellular components. I would think this could be extended to ChEBI entities, existing relations and GO cellular components.

The question that I still have with respect to this exact ticket is whether Reactome expects all the mismatches to eventually be mapped to Rhea and be incorporated into ChEBI and get blessed in the usable set.

deustp01 commented 2 years ago

whether Reactome expects all the mismatches to eventually be mapped to Rhea and be incorporated into ChEBI and get blessed in the usable set.

Yes, as above, that is the hope: "if we can't specify something well enough to get a ChEBI identifier for it, it's not well enough specified for GO-CAM either." I'm expecting / hoping / guessing from the work with Rhea and ChEBI over the past few years that we are not going to run into the issue of chemicals important to annotate human (patho)physiology that are a priori out of scope for these other resources. Also, there are generic terms, items like "polypeptide" or "nucleotide" that we can continue to use to ensure that all Reactome physical entities can be mapped to something in ChEBI to enable conversion to GO-CAM to proceed.

cmungall commented 2 years ago

I am confident we can get a simple biologist-friendly that satisfies all our requirements IF chebi can fix one thing.

Right now it is impossible to make a subset of chebi that excludes non ph7.3 non-protonated forms without losing large numbers of important classifications. I finally got around to making a comprehensive report for CHEBI:

https://github.com/ebi-chebi/ChEBI/issues/4207

From a GO perspective, this is one of the most important things CHEBI could work on. I suspect this will be high priority for Rhea too. I know it is a priority for multiple other ontologies that use CHEBI.

Note that we would be interested in seeing a systematic approach to this - manually synchronizing the different branches for the different protonated forms is not scalable. I am willing to spend lots of time with the CHEBI team to explain how OWL can help solve this in a systematic way.