Georgetown-IR-Lab / QuickUMLS

System for Medical Concept Extraction and Linking
MIT License
376 stars 95 forks source link

Make this part of a Spacy pipeline #40

Closed ldorigo closed 4 years ago

ldorigo commented 5 years ago

Hi there.

Since you already use Spacy internally, I am wondering if there would be some good way to make the named entity resolver into a Spacy pipeline component?

It would make it much easier to integrate it with other NLP tasks.

I am currently looking into it but I'm very new to Spacy, so if you have suggestions, please let me know.

soldni commented 5 years ago

Hey -- I haven't looked at spacy pipelines, and I feel this would be a big effort to integrate. To be honest, I don't think the dependency on spacy is very crucial for this package, as it is only used for POS tagging; if anything, I think it would be good to move away for it to some more lightweight parser and enhance throughput.

ldorigo commented 5 years ago

Actually, I've given a look into it (I am using both QuickUMLS and Spacy for a project and I needed both to interoperate), and it's much easier than I thought. The work is almost done, I need to polish it a bit and will submit a pull request later today. Then you can see if you think it's relevant to add it :-)

Regarding performance, from the profiling I've done, more than 99% of the time is spent on entity matching, so I don't think spacy introduces any noticeable delay.

soldni commented 5 years ago

Sounds good -- thank you for experimenting with this! Looking forward to it.

burgersmoke commented 5 years ago

I have some code working for this so that QuickUMLS is a custom component in a spaCy pipeline. I'll try to set up a pull request this week.

ljak commented 4 years ago

Any news from this issue ? Thanks

soldni commented 4 years ago

Hi @ljak!

I do not have any bandwidth to support this work, but I'd be more than happy to review and merge contributions from either @burgersmoke or @ldorigo.

-Luca

ljak commented 4 years ago

Same for me! It's been a long time already, that's why I'm asking. If it is something that needs to be done, I could help as well.

burgersmoke commented 4 years ago

OK, this is close. I've been working on cleaning this up tonight so that it can work within QuickUMLS 1.4. I'm hoping I can have a pull request by the end of the weekend.

burgersmoke commented 4 years ago

OK. I have a very early pull request ready. There are some arguments hard-coded which could be easily fixed, but this should give the overall idea so that this can be used as a component in a pipeline rather than as its own pipeline.

I've been using this in syndromic surveillance for the last 1.5+ years with a preprocessor used to normalize chief complaint text (e.g. "n/v/d c/o cp") into more standard lexical forms before this gets passed to QuickUMLS. We've even been experimenting with other components which run after QuickUMLS. We've got great visions, but I'm sorry it's taken so long.

I wish I could have done this earlier, but since I work in syndromic surveillance, I hope you will forgive me that COVID-19 got in the way before prepping this pull request.

I welcome all feedback as I know this can be improved.

burgersmoke commented 4 years ago

@ldorigo what do you think about the pull request that I put here? Is this close to your thinking? There are many ways to potentially do this, so I am open to any thoughts.

soldni commented 4 years ago

@burgersmoke,

This is awesome, thank you for putting the work! I'll set time aside to look at the PR during the weekend :)

-Luca

soldni commented 4 years ago

I had a chance to review changes, and left a couple of minor comments. Otherwise, it looks awesome!

soldni commented 4 years ago

Closing this issue as we are tracking this in a PR.