explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.67k stars 4.36k forks source link

Custom NER model does not recognize other entities #3528

Closed Jacky-Miu closed 5 years ago

Jacky-Miu commented 5 years ago

I have followed the approach below to train a custom NER Model: https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/Training%20the%20Named%20Entity%20Recognizer%20in%20SpaCy.ipynb

I used the dataset in Kaggle to conduct my training: https://www.kaggle.com/dataturks/resume-entities-for-ner

The dataset contains 220 resumes, which has introduced about 11 new Entities. Other than these 11 new Entities, the annotated dataset has not contained any other Entities. After the training, which I have repeated the training data for about 20-30 times, I was able to use the new NER Model to recognize the new Entities in my 'test_data'. However, the nlp('test_data') can no longer recognize the original Entities. For example, it can no longer recognize DATE, PERSON, ORG, etc. My new NER Model can only recognize the 11 new Entities from my test_data, and nothing else.

My understanding is that this is the “catastrophic forgetting problem” as mentioned in spaCy's website. So I have added a few examples with the old Entities in my Training Data (around 5-6 examples) , and then passed the new dataset to spaCy for training again. After this training, there has not been any improvement, and my new NER Model was still unable to recognize the old Entities.

May I know what's the appropriate approach for training new Entities in spaCy? For cases like this, when the dataset has been annotated only with the new Entities, what can I do to be able to retain the ability of the nlp = spacy.load("en_core_web_sm") to recognize the old / standard Entities? Do I need to modify each of the annotated dataset to retain the old Entities in it before I can use the dataset to train the new Entities? This seems unrealistic, but I have no idea how to proceed. Is it fine to just add a few examples with the original Entities in it? If not, how many more is required for my present case. For your information, my training / testing data split is about 80/20, so the training dataset has around 170 data in it. Thank you in advance for any assistance rendered.

Your Environment

svlandeg commented 5 years ago

You'll have to add more old/standard examples to balance out the training. This part of the documentation explains how to generate those with a static "old" model and mix them in. 5-6 examples vs 170 does not feel sufficient - but you'll have to experiment a bit with what works best for your particular dataset.

vmolchan commented 5 years ago

It seems from your description that the problem is not a catastrophic forgetting. It potentially could be that you created a blank Spacy model and train it using new examples with 11 new entities, but not retrain the existing model. If it is correct, in the link you provided (https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/Training%20the%20Named%20Entity%20Recognizer%20in%20SpaCy.ipynb]) try to provide with pre-trained model instead of model=None by default.

Jacky-Miu commented 5 years ago

Thank you svlandeg and vmolchan for your prompt reply and suggestion. Let me have a try on your suggestions. I'll let you know if I can get my problem solved. Thank you again!

Jacky-Miu commented 5 years ago

It seems from your description that the problem is not a catastrophic forgetting. It potentially could be that you created a blank Spacy model and train it using new examples with 11 new entities, but not retrain the existing model. If it is correct, in the link you provided (https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/Training%20the%20Named%20Entity%20Recognizer%20in%20SpaCy.ipynb]) try to provide with pre-trained model instead of model=None by default.

I have reviewed my Code and the suggestions by the Tutorial. The following was suggested by the Tutorial:

if model is not None: nlp = spacy.load(model) # load existing spaCy model print("Loaded model '%s'" % model) else: nlp = spacy.blank('en') # create blank Language class

For me, I have no idea why I was presented an error when I followed the Tutorial to use nlp = spacy.blank('en') , so I have changed my code to the following:

if model is not None: nlp = model.load() print("loaded model '%s'" % model) else: nlp = en_core_web_sm.load()

Since I have defined model = None, the else statement (nlp = en_core_web_sm.load()) was executed. With the aforesaid, I have not been able to get my new NER Model to recognize the old Entities. Is that something wrong with my previous modification? Or is it due to the insufficient number of examples that has contained the old/original Entities? Thanks again.

svlandeg commented 5 years ago

The blank en model does not contain a pre-trained NER model, you need to use one of the precompiled models like en_core_web_sm. Check in your code first (before any retraining) that your current model is correctly recognising the old entities, then start mixing in new entities and retrain, all the while testing whether your model is now performing well on both old and new entities. You need to evaluate this as you retrain to figure out how many old and new examples you'll need.

Jacky-Miu commented 5 years ago

The blank en model does not contain a pre-trained NER model, you need to use one of the precompiled models like en_core_web_sm. Check in your code first (before any retraining) that your current model is correctly recognising the old entities, then start mixing in new entities and retrain, all the while testing whether your model is now performing well on both old and new entities. You need to evaluate this as you retrain to figure out how many old and new examples you'll need.

Thank you, svlandeg, for your suggestion! Let me run my examples with old and new entities as suggested first. Thanks.

chaitanya1019 commented 5 years ago

The blank en model does not contain a pre-trained NER model, you need to use one of the precompiled models like en_core_web_sm. Check in your code first (before any retraining) that your current model is correctly recognising the old entities, then start mixing in new entities and retrain, all the while testing whether your model is now performing well on both old and new entities. You need to evaluate this as you retrain to figure out how many old and new examples you'll need.

Thank you, svlandeg, for your suggestion! Let me run my examples with old and new entities as suggested first. Thanks.

Hi Jack, I am also working on the same kaggle dataset, and following the https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy here to train my custom ner model, the author has used a test data.json to test his custom ner model and he is getting accurate results. but in my case i want to use a resume in pdf/doc/docx format and extract the entities and its labels from my trained custom ner model. I used pdfminer.six to convert the pdf to text and applied my custom ner model onto it. Out of the 9 entities(Companies Worked at Skills Graduation Year College Name Degree Designation Email Address Location Name) i used to train my model, it's recognizing only 5-6 entities for all the resumes i tested out. Can u please help me with this. Please have a look at my code.

`from future import unicode_literals, print_function

import plac import random from pathlib import Path import spacy from spacy.util import minibatch, compounding from dataturks_train import convert_dataturks_to_spacy

training data

TRAIN_DATA = convert_dataturks_to_spacy("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/Entity-Recognition-In-Resumes-SpaCy/traindata.json")

@plac.annotations( model=("Model name. Defaults to blank 'en' model.", "option", "m", str), output_dir=("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2", "option", "o", Path), n_iter=("Number of training iterations", "option", "n", int), )

def main(model=None, output_dir='C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2', n_iter=100): """Load the model, set up the pipeline and train the entity recognizer.""" if model is not None: nlp = spacy.load(model) # load existing spaCy model print("Loaded model '%s'" % model) else: nlp = spacy.blank("en") # create blank Language class print("Created blank 'en' model")

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
    ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes):  # only train NER
    # reset and initialize the weights randomly – but only if we're
    # training a new model
    if model is None:
        nlp.begin_training()
    for itn in range(n_iter):
        print(n_iter-itn  ,"Iterations left")
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.2,  # dropout - make it harder to memorise data
                losses=losses,
            )
        print("Losses", losses)

# test the trained model
for text, _ in TRAIN_DATA:
    doc = nlp(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)

    # test the saved model
    print("Loading from", output_dir)
    nlp2 = spacy.load(output_dir)
    for text, _ in TRAIN_DATA:
        doc = nlp2(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

if name == "main": plac.call(main)`

test.py `from future import unicode_literals, print_function

import plac from pathlib import Path import spacy from pdf2text import extract_text

TEST_DATA = extract_text("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/resumes/train48.pdf")

def cleanup(token, lower = True): if lower: token = token.lower() return token.strip()

def main(output_dir='C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2'): """Load the model, set up the pipeline and train the entity recognizer."""

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)

print("Loading text from test data", output_dir)
document = nlp2(TEST_DATA)

print("entities are: ", document.ents)
# for ent in document.ents:
#     print(ent.text,",", ent.label_)

labels = set([w.label_ for w in document.ents])

for label in labels:
    entities = [cleanup(e.string, lower=False) for e in document.ents if label == e.label_]
    entities = list(set(entities))
    print(label, entities)

if name == "main": plac.call(main)`

convert_json_to_spacy.py `import json import logging def convert_dataturks_to_spacy(dataturks_JSON_FilePath): try: training_data = [] lines=[] with open(dataturks_JSON_FilePath, 'r', errors='ignore') as f: lines = f.readlines()

    for line in lines:
        data = json.loads(line)
        text = data['content']
        entities = []
        for annotation in data['annotation']:
            # only a single point in text annotation.
            point = annotation['points'][0]
            labels = annotation['label']
            # handle both list of labels or a single label.
            if not isinstance(labels, list):
                labels = [labels]

            for label in labels:
                # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                entities.append((point['start'], point['end'] + 1, label))

        training_data.append((text, {"entities": entities}))

    return training_data
except Exception as e:
    logging.exception("Unable to process " + dataturks_JSON_FilePath + "\n" + "error = " + str(e))
    return None`
ak3895 commented 5 years ago

Hi @chaitanya1019 , Yes I was trying to train a model to parse resumes. Was testing on sample txt files, and have not tried on an actual resume in a pdf format. Which version of spacy are you using ?

Jacky-Miu commented 5 years ago

The blank en model does not contain a pre-trained NER model, you need to use one of the precompiled models like en_core_web_sm. Check in your code first (before any retraining) that your current model is correctly recognising the old entities, then start mixing in new entities and retrain, all the while testing whether your model is now performing well on both old and new entities. You need to evaluate this as you retrain to figure out how many old and new examples you'll need.

Thank you, svlandeg, for your suggestion! Let me run my examples with old and new entities as suggested first. Thanks.

Hi Jack, I am also working on the same kaggle dataset, and following the https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy here to train my custom ner model, the author has used a test data.json to test his custom ner model and he is getting accurate results. but in my case i want to use a resume in pdf/doc/docx format and extract the entities and its labels from my trained custom ner model. I used pdfminer.six to convert the pdf to text and applied my custom ner model onto it. Out of the 9 entities(Companies Worked at Skills Graduation Year College Name Degree Designation Email Address Location Name) i used to train my model, it's recognizing only 5-6 entities for all the resumes i tested out. Can u please help me with this. Please have a look at my code. `from future import unicode_literals, print_function import plac import random from pathlib import Path import spacy from spacy.util import minibatch, compounding from dataturks_train import convert_dataturks_to_spacy

training data

TRAIN_DATA = convert_dataturks_to_spacy("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/Entity-Recognition-In-Resumes-SpaCy/traindata.json") @plac.annotations( model=("Model name. Defaults to blank 'en' model.", "option", "m", str), output_dir=("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2", "option", "o", Path), n_iter=("Number of training iterations", "option", "n", int), ) def main(model=None, output_dir='C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2', n_iter=100): """Load the model, set up the pipeline and train the entity recognizer.""" if model is not None: nlp = spacy.load(model) # load existing spaCy model print("Loaded model '%s'" % model) else: nlp = spacy.blank("en") # create blank Language class print("Created blank 'en' model")

create the built-in pipeline components and add them to the pipeline

nlp.create_pipe works for built-ins that are registered with spaCy

if "ner" not in nlp.pipe_names: ner = nlp.create_pipe("ner") nlp.add_pipe(ner, last=True)

otherwise, get it so we can add labels

else: ner = nlp.get_pipe("ner")

add labels

for _, annotations in TRAIN_DATA: for ent in annotations.get("entities"): ner.add_label(ent[2])

get names of other pipes to disable them during training

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] with nlp.disable_pipes(*other_pipes): # only train NER

reset and initialize the weights randomly – but only if we're

# training a new model
if model is None:
    nlp.begin_training()
for itn in range(n_iter):
    print(n_iter-itn  ,"Iterations left")
    random.shuffle(TRAIN_DATA)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  # batch of texts
            annotations,  # batch of annotations
            drop=0.2,  # dropout - make it harder to memorise data
            losses=losses,
        )
    print("Losses", losses)

test the trained model

for text, _ in TRAINDATA: doc = nlp(text) print("Entities", [(ent.text, ent.label) for ent in doc.ents]) print("Tokens", [(t.text, t.enttype, t.ent_iob) for t in doc])

save model to output directory

if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() nlp.to_disk(output_dir) print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
    doc = nlp2(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

if name == "main": plac.call(main) test.py from future import unicode_literals, print_function import plac from pathlib import Path import spacy from pdf2text import extract_text TEST_DATA = extract_text("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/resumes/train48.pdf") def cleanup(token, lower = True): if lower: token = token.lower() return token.strip() def main(output_dir='C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2'): """Load the model, set up the pipeline and train the entity recognizer."""

test the saved model

print("Loading from", output_dir) nlp2 = spacy.load(output_dir)

print("Loading text from test data", output_dir) document = nlp2(TEST_DATA)

print("entities are: ", document.ents)

for ent in document.ents:

print(ent.text,",", ent.label_)

labels = set([w.label_ for w in document.ents])

for label in labels: entities = [cleanup(e.string, lower=False) for e in document.ents if label == e.label_] entities = list(set(entities)) print(label, entities)

if name == "main": plac.call(main) convert_json_to_spacy.py import json import logging def convert_dataturks_to_spacy(dataturks_JSON_FilePath): try: training_data = [] lines=[] with open(dataturks_JSON_FilePath, 'r', errors='ignore') as f: lines = f.readlines() for line in lines: data = json.loads(line) text = data['content'] entities = [] for annotation in data['annotation']:

only a single point in text annotation.

        point = annotation['points'][0]
        labels = annotation['label']
        # handle both list of labels or a single label.
        if not isinstance(labels, list):
            labels = [labels]

        for label in labels:
            # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
            entities.append((point['start'], point['end'] + 1, label))

    training_data.append((text, {"entities": entities}))

return training_data

except Exception as e: logging.exception("Unable to process " + dataturks_JSON_FilePath + "\n" + "error = " + str(e)) return None @chaitanya1019 I don't think you need to be concerned about your code / algorithm. As you have been able to recognize some of the fields, it means your machine learning model is working. I think the problem is more on the dataset itself. As you can see, this dataset came from indeed.com, and there are quite some peculiar characteristics of the data in it. For example, the Name field is ALWAYS the first 2 words in almost every document in this dataset, so the trained model will only recognize the first 2 words as the Name. This means that the trained model can give an accurate result if the test document is one from the indeed.com. However, if you use your test document other than from indeed.com, and your test document does not have the Name field at the first 2 words, it will be a wrong classification. I also noted that the Locations (LOC) are Indian locations, so the machine learning should have recognized these Indian locations as LOC field. Again, if you are working on your test document with locations other than an Indian location, I think the trained model is unable to recognize too. Last but not least, I also noted that many of the 'Companies Worked at' are 'Infosys Ltd', so there has not been sufficient variety of data in this dataset to teach the model. In the circumstances, the trained model can give quite accurate result if very similar set of data is subject to the testing, but will not be able to generalize itself to cater for different variety of unseen data.

chaitanya1019 commented 5 years ago

The blank en model does not contain a pre-trained NER model, you need to use one of the precompiled models like en_core_web_sm. Check in your code first (before any retraining) that your current model is correctly recognising the old entities, then start mixing in new entities and retrain, all the while testing whether your model is now performing well on both old and new entities. You need to evaluate this as you retrain to figure out how many old and new examples you'll need. Thank you, svlandeg, for your suggestion! Let me run my examples with old and new entities as suggested first. Thanks. Hi Jack, I am also working on the same kaggle dataset, and following the https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy here to train my custom ner model, the author has used a test data.json to test his custom ner model and he is getting accurate results. but in my case i want to use a resume in pdf/doc/docx format and extract the entities and its labels from my trained custom ner model. I used pdfminer.six to convert the pdf to text and applied my custom ner model onto it. Out of the 9 entities(Companies Worked at Skills Graduation Year College Name Degree Designation Email Address Location Name) i used to train my model, it's recognizing only 5-6 entities for all the resumes i tested out. Can u please help me with this. Please have a look at my code. `from future import unicode_literals, print_function import plac import random from pathlib import Path import spacy from spacy.util import minibatch, compounding from dataturks_train import convert_dataturks_to_spacy

training data

TRAIN_DATA = convert_dataturks_to_spacy("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/Entity-Recognition-In-Resumes-SpaCy/traindata.json") @plac.annotations( model=("Model name. Defaults to blank 'en' model.", "option", "m", str), output_dir=("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2", "option", "o", Path), n_iter=("Number of training iterations", "option", "n", int), ) def main(model=None, output_dir='C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2', n_iter=100): """Load the model, set up the pipeline and train the entity recognizer.""" if model is not None: nlp = spacy.load(model) # load existing spaCy model print("Loaded model '%s'" % model) else: nlp = spacy.blank("en") # create blank Language class print("Created blank 'en' model")

create the built-in pipeline components and add them to the pipeline

nlp.create_pipe works for built-ins that are registered with spaCy

if "ner" not in nlp.pipe_names: ner = nlp.create_pipe("ner") nlp.add_pipe(ner, last=True)

otherwise, get it so we can add labels

else: ner = nlp.get_pipe("ner")

add labels

for _, annotations in TRAIN_DATA: for ent in annotations.get("entities"): ner.add_label(ent[2])

get names of other pipes to disable them during training

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] with nlp.disable_pipes(*other_pipes): # only train NER

reset and initialize the weights randomly – but only if we're

training a new model

if model is None: nlp.begin_training() for itn in range(n_iter): print(n_iter-itn ,"Iterations left") random.shuffle(TRAIN_DATA) losses = {}

batch up the examples using spaCy's minibatch

batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) for batch in batches: texts, annotations = zip(*batch) nlp.update( texts, # batch of texts annotations, # batch of annotations drop=0.2, # dropout - make it harder to memorise data losses=losses, ) print("Losses", losses)

test the trained model

for text, _ in TRAINDATA: doc = nlp(text) print("Entities", [(ent.text, ent.label) for ent in doc.ents]) print("Tokens", [(t.text, t.enttype, t.ent_iob) for t in doc])

save model to output directory

if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() nlp.to_disk(output_dir) print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
    doc = nlp2(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

if name == "main": plac.call(main)test.pyfrom future import unicode_literals, print_function import plac from pathlib import Path import spacy from pdf2text import extract_text TEST_DATA = extract_text("C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/resumes/train48.pdf") def cleanup(token, lower = True): if lower: token = token.lower() return token.strip() def main(output_dir='C:/Users/chaitanyas/Desktop/resume_parser_ner_spacy/PDF-Resume-Parsing-with-spacy/saved_model_0.2'): """Load the model, set up the pipeline and train the entity recognizer."""

test the saved model

print("Loading from", output_dir) nlp2 = spacy.load(output_dir) print("Loading text from test data", output_dir) document = nlp2(TEST_DATA) print("entities are: ", document.ents)

for ent in document.ents:

print(ent.text,",", ent.label_)

labels = set([w.label for w in document.ents]) for label in labels: entities = [cleanup(e.string, lower=False) for e in document.ents if label == e.label] entities = list(set(entities)) print(label, entities) if name == "main": plac.call(main)convert_json_to_spacy.pyimport json import logging def convert_dataturks_to_spacy(dataturks_JSON_FilePath): try: training_data = [] lines=[] with open(dataturks_JSON_FilePath, 'r', errors='ignore') as f: lines = f.readlines() for line in lines: data = json.loads(line) text = data['content'] entities = [] for annotation in data['annotation']:

only a single point in text annotation.

point = annotation['points'][0] labels = annotation['label']

handle both list of labels or a single label.

if not isinstance(labels, list): labels = [labels]

        for label in labels:
            # dataturks indices are both inclusive [start, end] but spacy is not [start, end)
            entities.append((point['start'], point['end'] + 1, label))

    training_data.append((text, {"entities": entities}))

return training_data

except Exception as e: logging.exception("Unable to process " + dataturks_JSON_FilePath + "\n" + "error = " + str(e)) return None @chaitanya1019 I don't think you need to be concerned about your code / algorithm. As you have been able to recognize some of the fields, it means your machine learning model is working. I think the problem is more on the dataset itself. As you can see, this dataset came from indeed.com, and there are quite some peculiar characteristics of the data in it. For example, the Name field is ALWAYS the first 2 words in almost every document in this dataset, so the trained model will only recognize the first 2 words as the Name. This means that the trained model can give an accurate result if the test document is one from the indeed.com. However, if you use your test document other than from indeed.com, and your test document does not have the Name field at the first 2 words, it will be a wrong classification. I also noted that the Locations (LOC) are Indian locations, so the machine learning should have recognized these Indian locations as LOC field. Again, if you are working on your test document with locations other than an Indian location, I think the trained model is unable to recognize too. Last but not least, I also noted that many of the 'Companies Worked at' are 'Infosys Ltd', so there has not been sufficient variety of data in this dataset to teach the model. In the circumstances, the trained model can give quite accurate result if very similar set of data is subject to the testing, but will not be able to generalize itself to cater for different variety of unseen data.

So should i add my new entity types to an existing pre-trained NER model(eg: en_core_web_sm), following the steps mentioned https://spacy.io/usage/training#example-new-entity-type.

chaitanya1019 commented 5 years ago

Hi @chaitanya1019 , Yes I was trying to train a model to parse resumes. Was testing on sample txt files, and have not tried on an actual resume in a pdf format. Which version of spacy are you using ?

i used the latest version 2.1.3, but it's throwing this issue the sample training data that i used. upon my research found that it has to do with the whitespaces in the training data, some answers suggested me to use 2.0.18..and the model was trained with no issues.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.