Closed realmarcin closed 2 years ago
Alright, so I first ran the NER and considered all terms that had an 'Exact' string match to be part of nodes and edges. Now the terms that had either 'Partial' string match or 'No' string match with NER were 'INNER' joined (keys for joining were NER=['TokenizedTerm', 'CURIE'], SSSOM=['subject_label', 'object_id']) with the SSSOM files Chris provided in the 'schemas' folder to give these files.
So as far as I understand correctly, the terms that would move forward to form nodes and edges would be the ones where the column 'object_match_field' == 'hasExactSynonym' ?
Also, if 'hasExactSynonym' does not exist for a term, then consider 'hasRelatedSynonym' otherwise 'rdfs:label' ?
So the hierarchy would be:
Please advice!
cc @cmungall @realmarcin @wdduncan
Hi Harshad, so per our discussion it looks like the current hierarchy would be similar, just explicitly adding the exact term match:
Per discussions with Chris there may be another case where we can generate a mapping by composing a chemical with a metabolic mode -- e.g. chalcopyrite oxidation. In this way we can separate the chemical entities (CHEBI) with metabolism modes (GO or ECOCORE probably) and gain more coverage and some generalizabilty going forward (with fewer cases having no mapping).
How do we resolve this:
Both have 'hasRelatedSynonym ' associaton. Which one would be considered the one we'd be interested in?
D-glucitol
is a subtype of glucitol
according CHEBI. Here is the hierarchy on OLS:
So, was the TaxId
matched to both the parent term (i.e., glucitol
) and child term (D-glucitol
)?
The actual term in the text was 'sorbitol' which has these 2 related synonyms ( glucitol, D-glucitol ) for TaxId = 285268
Sorry, I wasn't clear.
TaxId 285268
sorbitol
as related synonym to both glucitol
and D-glucitol
However, I do not know which is most appropriate for TaxId 285268
.
Since the 2 chemicals have different CHEBIs (but a common related synonym of sorbitol), I was wondering if we could(should?) include both or would there be a preference of one over the other?
There is also a D-sorbitol, so if the original term is 'sorbitol' then the correct synonym should be 'glucitol'. It may be the case that the name is actually underspecified and they mean D-sorbitol (we could tell if we had chemical formulas in right format). There is also an L-sorbitol (and L-glucitol) but that is a much rarer and less relevant enantiomer because in cells the upstream molecule is D-glucose which determines the enantiomer for sorbitol (D vs L).
For completeness, CHEBIa actually has 3 molecules for glucitol but seems incomplete in this view (maybe synonyms complete): https://www.ebi.ac.uk/ols/ontologies/chebi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCHEBI_30911
The other comment is that sorbitol is the 'common name', I've heard that many times but never glucitol. Not sure if this is british vs american or biology vs chem. It may be hard to catch but at least thinking about the 'common name' for future applications would be useful. For many applications it may not matter because of synonyms, but front user facing things like labeled graph or search sorbitol would be preferred.
This is the unique of chemicals listed under the carbon substrate column: https://github.com/Knowledge-Graph-Hub/kg-microbe/blob/master/schemas/distinct_carbon_substrates.txt
The goal is to create a SSOM mapping file to encode the results of the above chemical matching, as here for pathways: https://github.com/Knowledge-Graph-Hub/kg-microbe/blob/master/pathways.sssom.tsv