egerber / spaCy-entity-linker

spaCy module for linking text to Wikidata items
MIT License
215 stars 32 forks source link

IndexError on Strings containing Certain Characters #10

Closed matt-buckley closed 1 year ago

matt-buckley commented 2 years ago

When running a basic NLP model like en_core_web_lg with the sole addition of an entityLinker pipe, calling nlp() will throw an IndexError on certain strings, particularly those with certain whitespace characters such as newline characters. The error thrown and the line causing the error is:

`def _get_candidates_insent(self, sent, doc): ----> root = list(filter(lambda token: token.dep == "ROOT", sent))[0] excluded_children = [] candidates = []

IndexError: list index out of range`

I'm running Python version 3.9, spaCy version 3.2.4, and spaCy-entity-linker version 1.0.1

ninikolov commented 2 years ago

Same here

isu-shrestha commented 1 year ago

Having the same problem here. Did a temporary fix by removing white space: text = ' '.join(text.split())

MartinoMensio commented 1 year ago

Hi @matt-buckley @ninikolov and @isu-shrestha , Thank you for opening the issue. I recently became a maintainer of this package and did not notice the open issue. I just tested and merged #9 which should fix the issue. Can you confirm on your end?

Best, Martino

dennlinger commented 1 year ago

Hi @MartinoMensio, I also encountered this issue not too long ago (tried it two weeks ago and it failed). For my particular file, it now works! Thanks @jonwiggins for the fix :partying_face:

I'll contribute a PR with an additional test case, containing a minimal document sample that caused a crash. This way future iterations have a better checking and you can reproduce the issue yourself.

MartinoMensio commented 1 year ago

Hi @dennlinger, Thank you very much for joining this issue and for confirming that now it works! I'm looking forward to receiving your PR with the test case! This project needs it :)

Best, Martino