ESIPFed / sweet

Official repository for Semantic Web for Earth and Environmental Terminology (SWEET) Ontologies
Other
112 stars 33 forks source link

Align(?) Observation Data Model 2 Variable Names #142

Open cbode opened 4 years ago

cbode commented 4 years ago

What would be the process for finding existing ontology terms for the ODM2.0 controlled vocabularies, and how to add the terms not existing? Or do you create a new ontological structure for this controlled vocabulary set? ODM variables names are defined as things that can be measured or observed either by sensors or by sampling and lab analysis.

http://vocabulary.odm2.org/variablename/

Attribution

If you would like a nanoattribution, please indicate your ORCID id 0000-0002-9654-6352

lewismc commented 4 years ago

What would be the process for finding existing ontology terms for the ODM2.0 controlled vocabularies, and how to add the terms not existing?

2 things here, firstly finding ontology terms... would be a manual process executed by a subject matter expert (maybe yourself) and then I suggest adding terms one by one unless the suggested addition is too large and this becomes unmanageable then they could be grouped and we would set up a thorough review.

Or do you create a new ontological structure for this controlled vocabulary set?

At this stage, I don't know. I do not know the structure and extent of the ODM2.0 and therefore the impact is would have on SWEET. If you could enlighten us to this it would be appreciated.

graybeal commented 4 years ago

To find matches, I would be tempted to put ODM2.0 into BioPortal (you can keep it private if you want), and wait a few hours/overnight for the BioPortal mapping engine to find the syntactic matches to the SWEET ontology. I think that should fairly quickly give you a reading on how much matches up. (To be honest, not intimately familiar yet with that mapping, but I think it will be quite decent in this case.)

There's an even faster answer to get a quick feel for it. Turn the labels (column 2) of your ODM2.0 file into comma separated strings (attached file; had to delete the commas within terms), paste that into the BIoPortal Annotator (https://bioportal.bioontology.org/annotator) and set it to use SWEET as the annotation ontology. See how many of those strings find a match in SWEET. (Set 'use longest match' attribute.) You can only do 200 at a time (about 25%) using the UI, but you can use the API if you don't want to do 4 entries.

Alas, there's a hitch just at the moment with both these ideas. Right now BioPortal can't parse the latest SWEET, as it appears there's an issue related to the OWLAPI parsing library accessing SWEET ontologies. I hope to have that sorted by the morning, but meantime, you can do a similar comparison to CHEBI. Here's an example using the API with the first 200 terms, it produces 181 matches for your 'expert review'. (If you haven't used BioPortal API, you'll need a key; instructions on getting one and a link on how to use it are at https://bioportal.bioontology.org/help#Programming_with_the_BioPortal_API.)

http://data.bioontology.org/annotator?text=http://data.bioontology.org/annotator?text=1,1,1-Trichloroethane, 1,1,2,2-Tetrachloroethane, 1,1,2-Trichloroethane, 1,1-Dichloroethane, 1,1-Dichloroethene, 1,2,3-Trimethylbenzene, 1,2,4,5-tetrachlorobenzene, 1,2,4-Trichlorobenzene, 1,2,4-Trimethylbenzene, 1,2-Dibromo-3-chloropropane, 1,2-Dichlorobenzene, 1,2-Dichloroethane, 1,2-Dichloropropane, 1,2-Dimethylnaphthalene, 1,2-Dinitrobenzene, 1,2-Diphenylhydrazine, 1,3,5-Trimethylbenzene, 1,3-Dichlorobenzene, 1,3-Dimethyladamantane, 1,3-Dimethylnaphthalene, 1,3-Dinitrobenzene, 1,4,5,8-Tetramethylnaphthalene, 1,4,5-Trimethylnaphthalene, 1,4,6-Trimethylnaphthalene, 1,4-Dichlorobenzene, 1,4-Dimethylnaphthalene, 1,4-Dinitrobenzene, 1,5-Dimethylnaphthalene, 1,6,7-Trimethylnaphthalene, 1,6-Dimethylnaphthalene, 1,8-Dimethylnaphthalene, 1-Chloronaphthalene, 1-Ethylnaphthalene, 1-Methylanthracene, 1-Methyldibenzothiophene, 1-Methylfluorene, 1-Methylnaphthalene, 1-Methylphenanthrene, 1-Naphthalenol methylcarbamate, 19-Hexanoyloxyfucoxanthin, 2,2-dichlorovinyl dimethyl phosphate, 2,3,4,6-Tetrachlorophenol, 2,3,5-Trimethylnaphthalene, 2,3,6-Trimethylnaphthalene, 2,3-Dimethylnaphthalene, 2,4,5-Trichlorophenol, 2,4,6-Trichlorophenol, 2,4-Dichlorophenol, 2,4-Dimethylphenol, 2,4-Dinitrophenol, 2,4-Dinitrotoluene, 2,6-Dichlorophenol, 2,6-Dinitrotoluene, 2,7-Dimethylnaphthalene, 2-Butanone (MEK), 2-Butoxyethanol, 2-Chloronaphthalene, 2-Chlorophenol, 2-Hexanone, 2-Methylanthracene, 2-Methyldibenzothiophene, 2-Methylnaphthalene, 2-Methylphenanthrene, 2-Methylphenol, 2-Nitroaniline, 2-Nitrophenol, 3,3-Dichlorobenzidine, 3,6-Dimethylphenanthrene, 3-Nitroaniline, 4,4-DDD, 4,4-DDE, 4,4-DDT, 4,4-Methylenebis(2-chloroaniline), 4,4-Methylenebis(N,N-dimethylaniline), 4,6-Dinitro-2-methylphenol, 4-Bromophenylphenyl ether, 4-Chloro-3-methylphenol, 4-Chloroaniline, 4-Chlorophenylphenyl ether, 4-Methylchrysene, 4-Methyldibenzothiophene, 4-Methylphenol, 4-Nitroaniline, 4-Nitrophenol, 9 cis-Neoxanthin, 9,10-Dimethylanthracene, Absorbance, Abundance, Acenaphthene, Acenaphthylene, Acetate, Acetic Acid, Acetone, Acetophenone, Acid neutralizing capacity, Acid phosphatase, Acidity CO2 acidity, Acidity exchange, Acidity hot, Acidity mineral acidity, Acidity total acidity, Adamantane, Agency code, Albedo, Aldrin, Alkalinity bicarbonate, Alkalinity carbonate, Alkalinity carbonate plus bicarbonate, Alkalinity hydroxide, Alkalinity total, Alloxanthin, Alpha-N-Acetylglucosaminidase, Altitude, Aluminum, Aluminum dissolved, Aluminum particulate, Aluminum total, Ammonium flux, Aniline, Anthracene, Antimony dissolved, Antimony distribution coefficient, Antimony particulate, Antimony total, Area, Argon, Argon dissolved, Aroclor-1016, Aroclor-1242, Aroclor-1254, Aroclor-1260, Arsenic dissolved, Arsenic distribution coefficient, Arsenic particulate, Arsenic total, Asteridae coverage, Barium dissolved, Barium distribution coefficient, Barium particulate, Barium total, Barometric pressure, Baseflow, Batis maritima Coverage, Battery temperature, Battery voltage, Benthos, Benz(a)anthracene, Benzene, Benzo(a)pyrene, Benzo(b)fluoranthene, Benzo(b)fluorene, Benzo(e)pyrene, Benzo(g,h,i)perylene, Benzo(k)fluoranthene, Benzoic acid, Benzyl alcohol, Beryllium dissolved, Beryllium total, Beta-glucosidase, Bicarbonate, Bifenthrin, Biogenic silica, Biomass, Biomass phytoplankton, Biomass total, Biphenyl, Bis(2-chloroethoxy)methane, bis(2-Chloroethyl)ether, Bis-(2-ethylhexyl) phthalate, bis-2-chloroisopropyl ether, Blue-green algae (cyanobacteria) phycocyanin, BOD1, BOD2 carbonaceous, BOD20, BOD20 carbonaceous, BOD20 nitrogenous, BOD3 carbonaceous, BOD4 carbonaceous, BOD5, BOD5 carbonaceous, BOD5 nitrogenous, BOD6 carbonaceous, BOD7 carbonaceous, BODu, BODu carbonaceous, BODu nitrogenous, Body length, Borehole log material classification, Boron dissolved, Boron total, Borrichia frutescens Coverage, Bromide dissolved, Bromide total, Bromine, Bromine dissolved, Bromodichloromethane, Bromoform, Bromomethane (Methyl bromide), Bulk density, Bulk electrical conductivity, &ontologies=CHEBI&longest_only=true&exclude_numbers=false&whole_word_only=true&exclude_synonyms=false &ontologies=CHEBI&longest_only=true&exclude_numbers=false&whole_word_only=true&exclude_synonyms=false 

ODM2 keywords.txt

brandonnodnarb commented 4 years ago

@cbode, there is a good chunk of ODM2 terms that are defined in ChEBI (as well as other resources), as @graybeal mentions. Would it not be beneficial if ODM2 linked their terms to, or draw their definitions from, that resource?

The method @graybeal describes makes sense. I usually also want to know 'close' (or similar) matches. I'm not sure if BioPortal does this type of match, or can do this type of match.

I used SequenceMatcher to generate a similarity score. The similarity scores are available in a gist (tab delimited). I arbitrarily set the threshold to 75% match so there is a load of false positives. I have not removed any results so there are still many false positives in the file --- increasing in likelyhood from about line 110.

This is only meant to help get an idea of the scale of potential overlap and, perhaps, help with some direction. It is not "the answer" :)

emiliom commented 4 years ago

@cbode: @graybeal pointed me to this new issue at the ESIP meeting. Nice chatting with you on Friday, however briefly!

Could you say more about what's your driver or intended application for this mapping? That would give me more context to think about. It might be helpful to open an issue later on over at the ODM2 controlled vocabularies repo, https://github.com/ODM2/ODM2ControlledVocabularies/, depending on where this goes. The ODM2 gang (myself included) would probably be interested, though there may not be much bandwidth. I'm also tagging @miguelcleon b/c he engaged us fairly recently via email about some ODM2 variable name vocabulary additions driven by CZO cataloguing needs, which in turn generated some useful distinctions about the use of this vocabulary for actual results (the data) as opposed to as keywords for datasets.

Thanks to @brandonnodnarb for making available his gist based on SequenceMatcher similarity scores. I've started examining the table to get a better feel for what kind of matches it's generating. I'll share this soon, once I have something worthwhile. I've looked at SWEET occasionally, but it's been a while. @graybeal, could you elaborate on how the bioportal matching you describe would differ from @brandonnodnarb's SequenceMatcher results?

cmungall commented 4 years ago

Yet another alignment:

https://github.com/cmungall/sweet-obo-alignment/blob/master/align-sweet-odm.tsv

This probably would not look so different from the bioportal one, though I think bioportal is still quite conservative, e.g. no stemming.

This should have less false positives than a sequence matching approach but may have lower recall

graybeal commented 4 years ago

We got SWEET 3.3.0 into Bioportal, and I did the same exercise as in my first exercise above, only with SWEET instead of CHEBI (look near the end of the string for the term to replace). The matches were quite unsatisfying IMHO—your vocabulary's Battery Temperature would be matched by SWEET "battery" and "temperature". Battery Voltage likewise, and so on. Of the 50 matches (vs 181 for CHEBI, weighted extra by the chemical names at the beginning) quite a number were repeated partial matches.

Based on this result, I would say your value proposition only comes if you can match SWEET components to a single ODM term ("Battery Voltage" hasSweetPart "Battery", "Voltage"). Or, you could fairly quickly find the accurate individual mappings out of one of these lists, if those were of value.

Responding to an earlier question, I don't know offhand the exact matching scheme for Annotations and for LOOM mappings in BiopPortal, but I know it's documented 'in the literature'. If it seems relevant we can dig up the detailed documentation.

cmungall commented 4 years ago

I'll try running owl patternizer in this later and see if useful class expression equivalents come out

On Tue, Jul 23, 2019, 06:52 John Graybeal notifications@github.com wrote:

We got SWEET 3.3.0 into Bioportal, and I did the same exercise as in my first exercise above, only with SWEET instead of CHEBI (look near the end of the string for the term to replace). The matches were quite unsatisfying IMHO—your vocabulary's Battery Temperature would be matched by SWEET "battery" and "temperature". Battery Voltage likewise, and so on. Of the 50 matches (vs 181 for CHEBI, weighted extra by the chemical names at the beginning) quite a number were repeated partial matches.

Based on this result, I would say your value proposition only comes if you can match SWEET components to a single ODM term ("Battery Voltage" hasSweetPart "Battery", "Voltage"). Or, you could fairly quickly find the accurate individual mappings out of one of these lists, if those were of value.

Responding to an earlier question, I don't know offhand the exact matching scheme for Annotations and for LOOM mappings in BiopPortal, but I know it's documented 'in the literature'. If it seems relevant we can dig up the detailed documentation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/sweet/issues/142?email_source=notifications&email_token=AAAMMOJMB5FA6WIO26LA5IDQA2FCJA5CNFSM4IE75RBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2R5I6Y#issuecomment-514053243, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMMOLNVT3F5WZELYSF5RTQA2FCJANCNFSM4IE75RBA .

emiliom commented 4 years ago

Thanks @graybeal and @cmungall.

FYI, in https://github.com/cmungall/sweet-obo-alignment/blob/master/align-sweet-odm.tsv, match label entries often have one or more characters missing at the end, which seems odd.

cmungall commented 4 years ago

That's if stemming was applied. You can see this in the obscure expression in the last column. Stemming gives higher recall e.g. matching plural to singular but more error e.g organization to organism

On Sun, Jul 28, 2019, 18:37 Emilio Mayorga notifications@github.com wrote:

Thanks @graybeal https://github.com/graybeal and @cmungall https://github.com/cmungall.

FYI, in https://github.com/cmungall/sweet-obo-alignment/blob/master/align-sweet-odm.tsv, match label entries often have one or more characters missing at the end, which seems odd.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/sweet/issues/142?email_source=notifications&email_token=AAAMMOMMGJWJPUK3RQRAY7TQBYNSTA5CNFSM4IE75RBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD27IAGY#issuecomment-515801115, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMMOPFBBDSJPK336VDSZ3QBYNSTANCNFSM4IE75RBA .