feat: add UMLS terminology

Vincent-Maladiere commented 2 years ago

Description

For the moment, EDS-NLP only allows the extraction and normalisation of entities to ATC (via ROMEDI) and ICD10. As UMLS is an international resource and brings together many terminologies (including SnomedCT) in many languages, integrating it would greatly benefit the library and its users to

automatically categorise texts in a corpus according to different concept IDs
perform entity searching
create processing rules (if ent.concept_id is a child of CUIXXXXX then, ...)
do pre-annotation of corpora

What changes

Add a method to download the UMLS data and create a CUI to synonym dictionary that is also saved locally using pystow.
Add a TerminologyMatcher in the same fashion as CIM10 and its corresponding entrypoint.
Add tests in tests/pipelines/ner/test_umls.py
Edit documentation (docs/pipelines/index.md and docs/pipelines/ner/umls.md)
Add new dependencies:umls_downloader, tqdm
Edit changelog

Checklist

[ ] If this PR is a bug fix, the bug is documented in the test suite.
[x] Changes were documented in the changelog (pending section).
[x] If necessary, changes were made to the documentation (eg new pipeline).

Vincent-Maladiere commented 1 year ago

Hey @percevalw, are there some steps left to do before merging?

percevalw commented 1 year ago

Your work was merged in #165 🎉 Thank you again !

aphp / edsnlp