OmkarPathak / pyresparser

A simple resume parser used for extracting information from resumes
GNU General Public License v3.0
773 stars 394 forks source link

train model shows entity overlap #38

Open Manikandan0001 opened 3 years ago

Manikandan0001 commented 3 years ago

@OmkarPathak Can you please help me on train a custom model. Help me to train without overlapping. Is there function/methodology to avoid overlapping.

ValueError: [E103] Trying to set conflicting doc.ents: '(4774, 4778, 'Location')' and '(4744, 4789, 'College Name')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

BenSturgeon commented 3 years ago

I encountered a similar issue and edited the custom_train file in such a way as to fix it. Most of the changes are in the method called "determine". As far as I can tell the problem is converting from dataturks to spacy format, but it should eliminate any overlaps generally. Let me know if it helps.

custom_train_fixed.zip

Manikandan0001 commented 3 years ago

I encountered a similar issue and edited the custom_train file in such a way as to fix it. Most of the changes are in the method called "determine". As far as I can tell the problem is converting from dataturks to spacy format, but it should eliminate any overlaps generally. Let me know if it helps.

custom_train_fixed.zip

Thanks for your response @BenSturgeon , Let you know if it works.

Manikandan0001 commented 3 years ago

@BenSturgeon training was completed without any errors using your code. Thanks. But the parsing result after training is not that much effective. right?

qarampage commented 3 years ago

Hi, I am still getting error after using the custom_train_fixed file. C:\projects\py_virtual_env\venvr\venv\lib\site-packages\spacy\language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "Ritesh To be an asset to the company and de..." with entities "[[1427, 1470, 'Email Address'], [996, 1039, 'Skill...". Use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check the alignment. Misaligned entities ('-') will be ignored during training. gold = GoldParse(doc, **gold) Losses {'ner': 65305.11264929587} Starting iteration 1

and I also receive error when executing test_name.py. after executing the above training python module for only 1 time. and not sure where it is picking en_training from ? C:\projects\py_virtual_env\venvr\venv\lib\site-packages\spacy\util.py:275: UserWarning: [W031] Model 'en_training' (0.0.0) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.2). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate warnings.warn(warn_msg) Traceback (most recent call last): File "C:/projects/mygitlab/mlpython/Jupyter_Notebooks/Projects_LARGE/Resume-Parser-Source/test_name.py", line 44, in test_local_name()