dwadden / dygiepp

Span-based system for named entity, relation, and event extraction.
MIT License
573 stars 120 forks source link

Adds spacy interface for dygiepp inference usage. #81

Closed e3oroush closed 2 years ago

e3oroush commented 2 years ago

Hi I wanted to have a simple, easy-to-use interface outside of the allennlp CLI for this brilliant library available.
So I developed this PR that creates a new pipe line to the spaCy library and gives you all the extracted entities and relations inside spacy output document.
For usage example, there is a notebook to see how it can be used and the final outcome looks like.

dwadden commented 2 years ago

Very cool, thanks for doing this! I'll take a look over the weekend.

e3oroush commented 2 years ago

I found a problem regarding version conflict of spaCy and allennlp. If one simply perform the pip install -r requirements.txt wouldn't work, because spaCy 3 and allennlp 1.1 have version conflict. But I was able to install spaCy==3.0.0 after running pip install -r requirments.txt, I will try to fix this issue over the weekend.

e3oroush commented 2 years ago

My last commit should resolve the problem.

dwadden commented 2 years ago

Thanks! The next two weeks are a bit hectic, but I'll review this after Dec 17 and let you know if I've got any questions.

dwadden commented 2 years ago

I just gave your Jupyter notebook a try, and it worked. Very cool! I'll merge. I'll also add something about this in the README. Is it OK if I instruct people to @ you for issues involving the spacy interface?

dwadden commented 2 years ago

Also - if you have time, could you submit a PR with a test confirming that the Spacy outputs match the outputs from the conventional DyGIE inference pipeline (modulo nested entities, which you've removed)?

e3oroush commented 2 years ago

For addressing me, yes please do so, I would gladly help.
For the test file, I'm not sure if I got your point correctly, what did you mean by modulo nested entities? Where did I remove what? :smile:

dwadden commented 2 years ago

OK, I'll update the README.

Haha sorry, I wasn't super-clear. I'm referring to this comment in your PR, where it appears that you remove overlapping entities. Because of this, I assume the spacy outputs won't always be identical to the outputs of the original DyGIE inference?

e3oroush commented 2 years ago

Aha, yes exactly. Because you are using span-based entity detection, it doesn't work natively with named entity module of spaCy. I will write a unit test to show this issue.
One quick question about your opinion, do you think if it's worth to define the detected entities as spaCy spans and store them in the Document object (e.g. doc._.span_ents) as the list of span based entities? Although it wouldn't be the official spaCy entity anymore, but we don't have to remove some of the detected entities. :smiley:
Something like this:

dwadden commented 2 years ago

Hmm I don't have a strong opinion on storing them as spans vs. entities. How about both ways, if it's not too much more work?

e3oroush commented 2 years ago

No I don't think so, it just didn't come to me when I was developing it. I will try to add them as well as the unit test, over the weekend.