clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Update and improve cell type groundings #54

Closed bgyori closed 3 years ago

bgyori commented 3 years ago

This PR implements a script to keep groundings from the Cell Ontology (CellOntology.tsv.gz) reproducibly up to date, and updates these groundings to their latest state. It then adds a set of manually curated synonyms related to CD4+/CD8+ T cells that are mentioned very often in text, and have resulted in ubiquitous NER issues (they were cropped to just "CD4" or "CD8" and recognized as genes). One complication is that there isn't an exact match for e.g., "CD4+ T cell" in CO, and therefore, a close-enough match is added as a grounding to CellOntology.tsv.gz. Then, exact matches to MeSH are added to the NER-Grounding-Override.tsv.gz overrides file to re-ground these matches.