cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
97 stars 27 forks source link

Adding term normalization / ontology lookup feature to DataHarmonizer #406

Open ddooley opened 1 year ago

ddooley commented 1 year ago

We want the functionality, tied to LinkML specifications for a field to allow selected terms from one or more ontology branches or cherry-picked terms from them. Beyond this, we also need the dynamic ability, while editing a cell, to look up a list of closely related terms so that a user can normalize free text (containing one or more terms) into a list of ontology ids.

LexMapr was our older software for doing this and maybe with a revamp could be continued. There is also the OAK "annotate" or "lexmatch" commands we could try. Notes on these commands are in SLACK OBO Foundry group: https://obo-communitygroup.slack.com/archives/C03D93DEALA/p1692891065256629?thread_ts=1692889350.359419&cid=C03D93DEALA

Chris Mungall: Technically annotate is a bit more general as it finds term matches in whole text. But if you pass --matches-whole-text it essentially does lexmatch as a degenerate case

Chris Mungall: But note the output structures are different. Using lexmatch gives you SSSOM (by default) which is obviously very well designed and though through. I have not been able to gather interest in a profile of SSSOM for lexical matches but that was before we had the ISB workshop talking about matching literals https://github.com/mapping-commons/sssom/issues/155
[#155 is there interest in an analog of SSSOM for NER/CR/text annotation?](https://github.com/mapping-commons/sssom/issues/155)
There are a number of different tools that perform NER on text, from bioportal/zooma through to scispacy, [@cthoyt](https://github.com/cthoyt)'s Gilda ( gilda https://www.biorxiv.org/content/10.1101/2021.09.10.459803v1.full )
These all vary in their output but are some variant of text span location and ID plus metadata for the matched concept.
While the entity normalization step of NER could be seen as term matching, I think this is out of scope for SSSOM. However, I think it would make sense to have a SSSOM analog, where the SSSOM metadata element URIs are reused.
An OAK driven app: https://incatools.github.io/ontology-access-kit/datamodels/text-annotator/index.html  
[https://github.com/…](https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/datamodels/text_annotator.yaml) 
cpauvert commented 12 months ago

Hi @ddooley, Thanks a lot for the great work in crafting and maintaining DataHarmonizer, I fully support such a feature! We showcased DataHarmonizer in a workshop about metadata and ontologies targeted for very beginners and most of them asked whether an ontology look-up from within the DataHarmonizer was possible. Best,