AnasAito / SkillNER

A (smart) rule based NLP module to extract job skills from text
https://skillner.vercel.app/
MIT License
135 stars 49 forks source link

IndexError: list index out of range #68

Open Jibril-Frej opened 1 year ago

Jibril-Frej commented 1 year ago

Some strings make the annotate function crash:

import spacy
from spacy.matcher import PhraseMatcher

# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

skill_extractor.annotate("Learn how to become a professional wedding makeup artist")

If you run the code above you should get the following error:


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[69], line 1
----> 1 skill_extractor.annotate("Learn how to become a professional wedding makeup artist")

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
    123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
    124     text_obj, self.matchers['abv_matcher'])
    126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
    127     text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
    130     text_obj, self.matchers['low_form_matcher'])
    132 skills_on_token = self.skill_getters.get_token_match_skills(
    133     text_obj, self.matchers['token_matcher'])
    134 full_sk = skills_full + skills_abv

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
    329 for match_id, start, end in matcher(doc):
    330     id_ = matcher.vocab.strings[match_id]
--> 332     if text_obj[start].is_matchable:
    333         skills.append({'skill_id': id_+'_lowSurf',
    334                        'doc_node_value': str(doc[start:end]),
    335                        'doc_node_id': list(range(start, end)),
    336                        'type': 'lw_surf'})
    338 return skills, text_obj

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
    277 def __getitem__(
    278     self,
    279     index: int
    280 ) -> Word:
    281     """To get the word at the specified position by index
    282 
    283     Parameters
   (...)
    302     english
    303     """
--> 304     return self.list_words[index]

IndexError: list index out of range
ManalIrfan commented 1 year ago

Running into the same problem. Any way to maybe sanitize the string to not run into this problem?

ManalIrfan commented 1 year ago

Seems to be a problem with some unicode characters. Encoding to ascii and then decoding back to utf-8 works.

import unicodedata
...

text = "My Random Character text"
text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
annotations = skill_extractor.annotate(text )
Jibril-Frej commented 1 year ago

I am still running in the same issue using the encoding/decoding:

import spacy
from spacy.matcher import PhraseMatcher
import unicodedata

# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

text = "Learn how to become a professional wedding makeup artist"
text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
annotations = skill_extractor.annotate(text )

I still get the same error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[2], line 4
      2 text = "Learn how to become a professional wedding makeup artist"
      3 text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
----> 4 annotations = skill_extractor.annotate(text )

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
    123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
    124     text_obj, self.matchers['abv_matcher'])
    126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
    127     text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
    130     text_obj, self.matchers['low_form_matcher'])
    132 skills_on_token = self.skill_getters.get_token_match_skills(
    133     text_obj, self.matchers['token_matcher'])
    134 full_sk = skills_full + skills_abv

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
    329 for match_id, start, end in matcher(doc):
    330     id_ = matcher.vocab.strings[match_id]
--> 332     if text_obj[start].is_matchable:
    333         skills.append({'skill_id': id_+'_lowSurf',
    334                        'doc_node_value': str(doc[start:end]),
    335                        'doc_node_id': list(range(start, end)),
    336                        'type': 'lw_surf'})
    338 return skills, text_obj

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
    277 def __getitem__(
    278     self,
    279     index: int
    280 ) -> Word:
    281     """To get the word at the specified position by index
    282 
    283     Parameters
   (...)
    302     english
    303     """
--> 304     return self.list_words[index]

IndexError: list index out of range
chrisho51 commented 1 year ago

Facing this issue as well. Did you ever find a solve @Jibril-Frej ?

Jibril-Frej commented 1 year ago

No real fix. I just do a try catch.

try:
    skill_extractor.annotate(target_text)
except IndexError:
    pass
except ValueError:
    pass
AJeschor commented 5 months ago

I am also encountering this error. I would really like to use SkillNER but this issue is really preventing me from being able to do so.