Closed rxk2rxk closed 1 year ago
Hi, yes. I believe scisoacy is still using the 2020 release of UMLS. Unfortunately I no longer have access to UMLS, so I can't actually update it. That being said, if you do have access, the script to generate the artifacts used for entity linking is here: https://github.com/allenai/scispacy/blob/main/scripts/export_umls_json.py and https://github.com/allenai/scispacy/blob/main/scripts/create_linker.py
Hi Daniel,
Thanks for getting back to me quickly. You should be able to access the latest UMLS release (2022AB) as it doesn’t require user registration anymore:
https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
It would be good to subset the Metathesaurus (i.e., a few key taxonomies in English) as the current implementation makes unnecessary demands on memory.
Best, Ron
Ah I didn't know it no longer required registration! I'm not sure if/when I would get to updating to the latest UMLS, but I appreciate the info and will try to get to it at some point! As for subsets, we do have a few subsets available. See here in the readme: https://github.com/allenai/scispacy#entitylinker
Thanks, please update me know if/when you get to updating the UMLS database used by SciSpaCy. Let me know if you need help with this.
Re: subsetting, it would be helpful to be able to further subset the UMLS subset on the fly (during load) to exclude certain sources (e.g., NCBI).
On Jan 2, 2023, at 3:11 PM, Daniel King @.***> wrote: Ah I didn't know it no longer required registration! I'm not sure if/when I would get to updating to the latest UMLS, but I appreciate the info! As for subsets, we do have a few subsets available. See here in the readme: https://github.com/allenai/scispacy#entitylinker
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.
@rxk2rxk I am also looking to update the UMLS NER models to use the latest data. Just checking if you made progress with this yet?
Hi @rahulmohan I haven't tried yet as I'm maxed out at work. I'll update the issue if I make progress on this.
I'd be willing to do this and submit a PR for it. Not sure if it as simple as running scripts/create_linker.py
on the MRCONSO.rrf file or if I'd need to download the entire UMLS and run scripts/export_umls_json.py
. Also not sure if I could include the data for those files in the PR due to size or if I'd need to retrain and publish the models themselves which I am sure I don't have permissions for...
I think going forward making this process as simple as possible should be a requirement so no matter your load users can easily update the primary (UMLS) knowledge base to keep it up to date.
The first paragraph here raises a general question I had, is the UMLS data used only for the NER or is it a larger part of the model? I.e. if I created my own EntityLinker using 2022AB UMLS, would that solve this "outdated" issue?
I was able to (partially) build the UMLS knowledge base and linker on my Mac by running the following commands:
cd ~/Documents/GitHub/scispacy mkdir output cd scripts python3 export_umls_json.py --meta_path ~/Documents/Taxonomy/UMLS/2022AB/META --output_path ../output/umls_kb.jsonl python3 create_linker.py --kb_path ../output/umls_kb.jsonl --output_path ../output/
The export script successfully crested the KB (in JSONL format) but the linker script, which takes a very long time (> 2 hours), was killed by the OS halfway through due to excessive memory pressure.
@nanthony007 Yes, if you create your own EntityLinker
using your own set of paths (e.g. https://github.com/allenai/scispacy/blob/4f9ba0931d216ddfb9a8f01334d76cfb662738ae/scispacy/candidate_generation.py#L43-L48), it should work.
@rxk2rxk Let me try running those scripts on my machine...I do recall them being pretty resource intensive 😅
It looks to me that UMLS does still require registration to download the full subset. In scispacy we use sections 0, 1, 2, and 9.
Ok I have the latest umls files, will try to get the new linkers done soon.
Hi @dakinggg - Re: "Yes, if you create your own EntityLinker using your own set of paths ... it should work", I tried this using locally built files, e.g.
concept_aliases_list="file:///Users/rkatriel/python/scispacy/output/concept_aliases.json", # noqa
and made a minor change to file_cache.py on line 34 (added "file")
if parsed.scheme in ("http", "https", "file")
to make it work, but now - not surprisingly - I'm getting an error from Python's "requests" package (in sessions.py)
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///Users/rkatriel/python/scispacy/output/umls_kb.jsonl'
It looks like additional changes would be needed to make SciSpacy work with a locally stored UMLS knowledge base.
Local files should work natively with file_cache.py
, because of the else there
elif os.path.exists(url_or_filename):
# File, and it exists.
return url_or_filename
Thanks Daniel! I got past this issue by dropping off the "file:///" prefix from the URLs (not obvious immediately).
This is now done in the latest release :)
Hi,
I'm parsing text from clinicaltrials.gov (Trial ID NCT04837209) using scispaCy plus language model 'en_core_sci_md' and seeing 'Dostarlimab' being linked to UMLS concept C1621793 which is a bird (a Starling).
It looks like this is the result of fuzzy matching - both words have a substring ('starlit') in common - as evident by the low match probability (0.5594).
However, the biologic drug Dostarlimab is in the latest UMLS release (2022AB) as the concept C5242455. Is scispaCy linking to an older version of UMLS?
Thanks, Ron