Organize ontology changes into a single script and master file of synonyms

HolmquistJ commented 5 years ago

This work flow as I understand it was pitched to me by @ktoddbrown at a dinner at AGU. I think we should try and implement it here since we are thinking we may need to transition a lot of our work eventually from full time coding specialists, to part time general technicians.

All attribute names and variable names should be stored in a big table that our hook scripts interact with, both adding synonyms for our controlled attribute and variable names names, asking the data entry person to confirm what the function thinks the matches are, then making the conversions to the standard vocabulary.

For example, imagine a function, matchCcOntologies(listOfNewDataTables, listOfRecognizedVocabulary) { ... code stuff ... }

listOfNewDataTables would be a list of the data tables we are bringing in to the synthesis.

listOfRecognizedVocabulary would be a list of all of our controlled attribute and varriable names, as well as synonyms we have encountered from previous inputs.

For example, listOfRecognizedVocabulary snippet could look like this ... controlled_ontology	synonym	firstSeenByUs
dry_bulk_density	bulk density	Fakeman_et_al_2019
dry_bulk_density	dbd	Makeup_et_al_2016
loss_on_ignition	loi	Holmquist_2018
loss_on_ignition	loi550	Fakeman_et_al_2019

Step 1: The function would check all of the column and variable names in the list of new data tables against the listOfRecognizedVocabulary

If the new data table had a former entry for example "loi" and that was already recognized the function would ask you to approve.

If it is a close match using text recognition, then the function would ask you to approve it, and add the new spelling as a synonym and the data source to the master table.

If it needs to be defined the function would ask you to define it, then it would add that as a synonym and the data source to the master table.

The last option would be just saying that we don't control for the variable and making sure it is defined in the submitter's metadata.

Step 2: After iterating through all attribute names and categorical variables, the script would convert the user's attribute and variable names to the CCRCN controlled vocab.

Step 3: Independent check to make sure all attributes and variables are either defined by our controlled vocab, or defined in the user's metadata.

dklinges9 commented 5 years ago

This makes sense and is what I would hope we would move towards as well. For this aspect,

"If it is a close match using text recognition, then the function would ask you to approve it..."

however, are we referring to machine learning? Or simply a large list of strings that correspond to synonyms of controlled vocab? Maybe we should discuss if you or Kathe had a particular method in mind.

HolmquistJ commented 5 years ago

I wouldn't start using super complicated machine learning approach, maybe just partial string matching in R like grepl().

ktoddbrown commented 5 years ago

This came out of the idea of a key based translation between one data model to a second. In theory this would allow us to work up a GUI to allow users to match their vocabulary with ours. Here is the current structure of the 'key' in my head. I'm working to get this up and running for ISCN but that's currently on an older structure.

source_table	source_id	source_header_or_entry	target_vocabulary	target_type
[filename OR worksheet OR table name]	header/entry	[header name OR free text entry]	[control vocabulary name]	value/uncertainty/unit/method

I like the idea of a full on machine learning project but think that's something to tuck away for a future proposal partnering with some ML/natural language processing folks. I suspect there is a way to do a partial string match to help populate a drop down box for a GUI, which is more in line with what you are thinking of @HolmquistJ.

mlonneman commented 5 years ago

I developed a very rough, incomplete prototype of what this might look like as a curation function. It does basically everything listed in @HolmquistJ 's step 1 as he describes above. See "scripts/1_data_formatting/experimental_curation_functions.R"

jaxinewolfe commented 2 years ago

@HolmquistJ Is this still something you'd like to implement?

Smithsonian / CCN-Data-Library

Organize ontology changes into a single script and master file of synonyms #14