Closed HolmquistJ closed 11 months ago
This makes sense and is what I would hope we would move towards as well. For this aspect,
"If it is a close match using text recognition, then the function would ask you to approve it..."
however, are we referring to machine learning? Or simply a large list of strings that correspond to synonyms of controlled vocab? Maybe we should discuss if you or Kathe had a particular method in mind.
I wouldn't start using super complicated machine learning approach, maybe just partial string matching in R like grepl()
.
This came out of the idea of a key based translation between one data model to a second. In theory this would allow us to work up a GUI to allow users to match their vocabulary with ours. Here is the current structure of the 'key' in my head. I'm working to get this up and running for ISCN but that's currently on an older structure.
source_table | source_id | source_header_or_entry | target_vocabulary | target_type |
---|---|---|---|---|
[filename OR worksheet OR table name] | header/entry | [header name OR free text entry] | [control vocabulary name] | value/uncertainty/unit/method |
I like the idea of a full on machine learning project but think that's something to tuck away for a future proposal partnering with some ML/natural language processing folks. I suspect there is a way to do a partial string match to help populate a drop down box for a GUI, which is more in line with what you are thinking of @HolmquistJ.
I developed a very rough, incomplete prototype of what this might look like as a curation function. It does basically everything listed in @HolmquistJ 's step 1 as he describes above. See "scripts/1_data_formatting/experimental_curation_functions.R"
@HolmquistJ Is this still something you'd like to implement?
This work flow as I understand it was pitched to me by @ktoddbrown at a dinner at AGU. I think we should try and implement it here since we are thinking we may need to transition a lot of our work eventually from full time coding specialists, to part time general technicians.
All attribute names and variable names should be stored in a big table that our hook scripts interact with, both adding synonyms for our controlled attribute and variable names names, asking the data entry person to confirm what the function thinks the matches are, then making the conversions to the standard vocabulary.
For example, imagine a function, matchCcOntologies(listOfNewDataTables, listOfRecognizedVocabulary) { ... code stuff ... }
listOfNewDataTables would be a list of the data tables we are bringing in to the synthesis.
listOfRecognizedVocabulary would be a list of all of our controlled attribute and varriable names, as well as synonyms we have encountered from previous inputs.
Step 1: The function would check all of the column and variable names in the list of new data tables against the listOfRecognizedVocabulary
If the new data table had a former entry for example "loi" and that was already recognized the function would ask you to approve.
If it is a close match using text recognition, then the function would ask you to approve it, and add the new spelling as a synonym and the data source to the master table.
If it needs to be defined the function would ask you to define it, then it would add that as a synonym and the data source to the master table.
The last option would be just saying that we don't control for the variable and making sure it is defined in the submitter's metadata.
Step 2: After iterating through all attribute names and categorical variables, the script would convert the user's attribute and variable names to the CCRCN controlled vocab.
Step 3: Independent check to make sure all attributes and variables are either defined by our controlled vocab, or defined in the user's metadata.