allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.7k stars 227 forks source link

Update UMLS to latest release #460

Closed rxk2rxk closed 1 year ago

rxk2rxk commented 1 year ago

Hi,

I'm parsing text from clinicaltrials.gov (Trial ID NCT04837209) using scispaCy plus language model 'en_core_sci_md' and seeing 'Dostarlimab' being linked to UMLS concept C1621793 which is a bird (a Starling).

It looks like this is the result of fuzzy matching - both words have a substring ('starlit') in common - as evident by the low match probability (0.5594).

However, the biologic drug Dostarlimab is in the latest UMLS release (2022AB) as the concept C5242455. Is scispaCy linking to an older version of UMLS?

Thanks, Ron

dakinggg commented 1 year ago

Hi, yes. I believe scisoacy is still using the 2020 release of UMLS. Unfortunately I no longer have access to UMLS, so I can't actually update it. That being said, if you do have access, the script to generate the artifacts used for entity linking is here: https://github.com/allenai/scispacy/blob/main/scripts/export_umls_json.py and https://github.com/allenai/scispacy/blob/main/scripts/create_linker.py

rxk2rxk commented 1 year ago

Hi Daniel,

Thanks for getting back to me quickly. You should be able to access the latest UMLS release (2022AB) as it doesn’t require user registration anymore:

https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html

It would be good to subset the Metathesaurus (i.e., a few key taxonomies in English) as the current implementation makes unnecessary demands on memory.

Best, Ron

dakinggg commented 1 year ago

Ah I didn't know it no longer required registration! I'm not sure if/when I would get to updating to the latest UMLS, but I appreciate the info and will try to get to it at some point! As for subsets, we do have a few subsets available. See here in the readme: https://github.com/allenai/scispacy#entitylinker

rxk2rxk commented 1 year ago

Thanks, please update me know if/when you get to updating the UMLS database used by SciSpaCy. Let me know if you need help with this.

Re: subsetting, it would be helpful to be able to further subset the UMLS subset on the fly (during load) to exclude certain sources (e.g., NCBI).

On Jan 2, 2023, at 3:11 PM, Daniel King @.***> wrote:  Ah I didn't know it no longer required registration! I'm not sure if/when I would get to updating to the latest UMLS, but I appreciate the info! As for subsets, we do have a few subsets available. See here in the readme: https://github.com/allenai/scispacy#entitylinker

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

rahulmohan commented 1 year ago

@rxk2rxk I am also looking to update the UMLS NER models to use the latest data. Just checking if you made progress with this yet?

rxk2rxk commented 1 year ago

Hi @rahulmohan I haven't tried yet as I'm maxed out at work. I'll update the issue if I make progress on this.

nanthony007 commented 1 year ago

I'd be willing to do this and submit a PR for it. Not sure if it as simple as running scripts/create_linker.py on the MRCONSO.rrf file or if I'd need to download the entire UMLS and run scripts/export_umls_json.py. Also not sure if I could include the data for those files in the PR due to size or if I'd need to retrain and publish the models themselves which I am sure I don't have permissions for...

I think going forward making this process as simple as possible should be a requirement so no matter your load users can easily update the primary (UMLS) knowledge base to keep it up to date.

The first paragraph here raises a general question I had, is the UMLS data used only for the NER or is it a larger part of the model? I.e. if I created my own EntityLinker using 2022AB UMLS, would that solve this "outdated" issue?

rxk2rxk commented 1 year ago

I was able to (partially) build the UMLS knowledge base and linker on my Mac by running the following commands:

cd ~/Documents/GitHub/scispacy mkdir output cd scripts python3 export_umls_json.py --meta_path ~/Documents/Taxonomy/UMLS/2022AB/META --output_path ../output/umls_kb.jsonl python3 create_linker.py --kb_path ../output/umls_kb.jsonl --output_path ../output/

The export script successfully crested the KB (in JSONL format) but the linker script, which takes a very long time (> 2 hours), was killed by the OS halfway through due to excessive memory pressure.

dakinggg commented 1 year ago

@nanthony007 Yes, if you create your own EntityLinker using your own set of paths (e.g. https://github.com/allenai/scispacy/blob/4f9ba0931d216ddfb9a8f01334d76cfb662738ae/scispacy/candidate_generation.py#L43-L48), it should work.

@rxk2rxk Let me try running those scripts on my machine...I do recall them being pretty resource intensive 😅

dakinggg commented 1 year ago

It looks to me that UMLS does still require registration to download the full subset. In scispacy we use sections 0, 1, 2, and 9.

dakinggg commented 1 year ago

Ok I have the latest umls files, will try to get the new linkers done soon.

rxk2rxk commented 1 year ago

Hi @dakinggg - Re: "Yes, if you create your own EntityLinker using your own set of paths ... it should work", I tried this using locally built files, e.g.

concept_aliases_list="file:///Users/rkatriel/python/scispacy/output/concept_aliases.json", # noqa

and made a minor change to file_cache.py on line 34 (added "file")

if parsed.scheme in ("http", "https", "file")

to make it work, but now - not surprisingly - I'm getting an error from Python's "requests" package (in sessions.py)

requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///Users/rkatriel/python/scispacy/output/umls_kb.jsonl'

It looks like additional changes would be needed to make SciSpacy work with a locally stored UMLS knowledge base.

dakinggg commented 1 year ago

Local files should work natively with file_cache.py, because of the else there

    elif os.path.exists(url_or_filename):
        # File, and it exists.
        return url_or_filename
rxk2rxk commented 1 year ago

Thanks Daniel! I got past this issue by dropping off the "file:///" prefix from the URLs (not obvious immediately).

dakinggg commented 1 year ago

This is now done in the latest release :)