NER for chemicals aka carbon substrates in trait table

realmarcin commented 3 years ago

This is the unique of chemicals listed under the carbon substrate column: https://github.com/Knowledge-Graph-Hub/kg-microbe/blob/master/schemas/distinct_carbon_substrates.txt

Run OGER NER on carbon substrates and characterize output based on exact string match vs X etc.
Use (four different types) synonyms from CHEBI to expand matches.
Identify poorly matching subset and see if there are additional matches (and append to mapping table).

The goal is to create a SSOM mapping file to encode the results of the above chemical matching, as here for pathways: https://github.com/Knowledge-Graph-Hub/kg-microbe/blob/master/pathways.sssom.tsv

hrshdhgd commented 3 years ago

Alright, so I first ran the NER and considered all terms that had an 'Exact' string match to be part of nodes and edges. Now the terms that had either 'Partial' string match or 'No' string match with NER were 'INNER' joined (keys for joining were NER=['TokenizedTerm', 'CURIE'], SSSOM=['subject_label', 'object_id']) with the SSSOM files Chris provided in the 'schemas' folder to give these files.

So as far as I understand correctly, the terms that would move forward to form nodes and edges would be the ones where the column 'object_match_field' == 'hasExactSynonym' ?

Also, if 'hasExactSynonym' does not exist for a term, then consider 'hasRelatedSynonym' otherwise 'rdfs:label' ?

So the hierarchy would be:

hasExactSynonym
hasRelatedSynonym
rdfs:label ??

Please advice!

cc @cmungall @realmarcin @wdduncan

realmarcin commented 3 years ago

Hi Harshad, so per our discussion it looks like the current hierarchy would be similar, just explicitly adding the exact term match:

exact term match
hasExactSynonym
hasRelatedSynonym
rdfs:label ??

Per discussions with Chris there may be another case where we can generate a mapping by composing a chemical with a metabolic mode -- e.g. chalcopyrite oxidation. In this way we can separate the chemical entities (CHEBI) with metabolism modes (GO or ECOCORE probably) and gain more coverage and some generalizabilty going forward (with fewer cases having no mapping).

hrshdhgd commented 3 years ago

How do we resolve this:

Screen Shot 2021-03-04 at 5 05 28 PM

Both have 'hasRelatedSynonym ' associaton. Which one would be considered the one we'd be interested in?

wdduncan commented 3 years ago

D-glucitol is a subtype of glucitol according CHEBI. Here is the hierarchy on OLS:

https://www.ebi.ac.uk/ols/ontologies/chebi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCHEBI_17924

So, was the TaxId matched to both the parent term (i.e., glucitol) and child term (D-glucitol)?

hrshdhgd commented 3 years ago

The actual term in the text was 'sorbitol' which has these 2 related synonyms ( glucitol, D-glucitol ) for TaxId = 285268

wdduncan commented 3 years ago

Sorry, I wasn't clear.

Yes, I noticed that each match was for TaxId 285268
When I look on OLS, I see sorbitol as related synonym to both glucitol and D-glucitol

However, I do not know which is most appropriate for TaxId 285268.

hrshdhgd commented 3 years ago

Since the 2 chemicals have different CHEBIs (but a common related synonym of sorbitol), I was wondering if we could(should?) include both or would there be a preference of one over the other?

realmarcin commented 3 years ago

There is also a D-sorbitol, so if the original term is 'sorbitol' then the correct synonym should be 'glucitol'. It may be the case that the name is actually underspecified and they mean D-sorbitol (we could tell if we had chemical formulas in right format). There is also an L-sorbitol (and L-glucitol) but that is a much rarer and less relevant enantiomer because in cells the upstream molecule is D-glucose which determines the enantiomer for sorbitol (D vs L).

For completeness, CHEBIa actually has 3 molecules for glucitol but seems incomplete in this view (maybe synonyms complete): https://www.ebi.ac.uk/ols/ontologies/chebi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCHEBI_30911 Screen Shot 2021-03-09 at 3 14 01 PM

The other comment is that sorbitol is the 'common name', I've heard that many times but never glucitol. Not sure if this is british vs american or biology vs chem. It may be hard to catch but at least thinking about the 'common name' for future applications would be useful. For many applications it may not matter because of synonyms, but front user facing things like labeled graph or search sorbitol would be preferred.

Knowledge-Graph-Hub / kg-microbe

NER for chemicals aka carbon substrates in trait table #17