explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.96k stars 4.39k forks source link

Evaluate Command Line method returns NER P R F equals to null #8469

Closed SNiiceron closed 3 years ago

SNiiceron commented 3 years ago

Hi every one. I'm trying to evaluate my NER model that is supposed to recognized industrial process such as 3D printing or Laser Cutting. I created a trainData file with a tool called Doccano. I want to evaluate the model with the command Line method "python -m spacy evaluate" with a file name evaluationData.spacy. It's the file i'm using to see if my model is recognizing correclty those industial process. What I have in return is this : { "token_acc":1.0, "ents_p":null, "ents_r":null, "ents_f":null, "speed":11231.4352399962 }

I don't understand why precision accuracy and f are equals to null. Is there something wring with the creation of my model, or with my evaluationData.spacy file ?

Thanks for any answers, I'll be able to share some part of my code if it's help to fix my problem

adrianeboyd commented 3 years ago

Hi, I don't know for sure without more details, but this typically indicates that the docs in the evaluation data don't have doc.ents set. If they're all unset, the scorer will return null / None for ents_p/r/f.

Do you see entities if you just run the model on some of the plain training texts? If you do see predictions, it's probably the eval data. If so, can you show how you created the evaluation docs?

If you don't see any predictions, can you provide more information about how you created the training docs and trained the model?

SNiiceron commented 3 years ago

I have this jsonl file that I transform into a spacy file with this convert.py script

import srsly
import typer
import warnings
from pathlib import Path

import spacy
from spacy.tokens import DocBin

def convert(lang: str, input_path: Path, output_path: Path):
    nlp = spacy.blank(lang)
    db = DocBin()
    for line in srsly.read_jsonl(input_path):
        doc = nlp.make_doc(line["text"])
        doc.cats = line["annotations"]
        db.add(doc)
    db.to_disk(output_path)

if __name__ == "__main__":
    typer.run(convert)
{"id": 664, "text": "The use of a lighter set of drilling tools", "annotations": [{"label": 18, "start_offset": 28, "end_offset": 37, "user": 1, "created_at": "2021-06-22T13:48:05.582662Z", "updated_at": "2021-06-22T13:48:05.582662Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 665, "text": "For drilling the deeper wells, the derrick, on account of the length of the string of drilling tools, is usually at least 7 o ft", "annotations": [{"label": 18, "start_offset": 4, "end_offset": 13, "user": 1, "created_at": "2021-06-22T13:48:08.983661Z", "updated_at": "2021-06-22T13:48:08.983661Z"}, {"label": 18, "start_offset": 86, "end_offset": 95, "user": 1, "created_at": "2021-06-22T13:48:12.246661Z", "updated_at": "2021-06-22T13:48:12.246661Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 666, "text": "The drilling tools are suspended by an untarred manila rope, 2 in.", "annotations": [{"label": 18, "start_offset": 4, "end_offset": 13, "user": 1, "created_at": "2021-06-22T13:48:16.323714Z", "updated_at": "2021-06-22T13:48:16.323714Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 667, "text": "The drilling crew consists of two drillers Well", "annotations": [{"label": 18, "start_offset": 4, "end_offset": 13, "user": 1, "created_at": "2021-06-22T13:48:20.911661Z", "updated_at": "2021-06-22T13:48:20.911661Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 668, "text": "The drilling of a well is commonly carried out under contract, the producer erecting the derrick and providing the engine and boiler while the drilling contractor finds the tools, and is Drill ing the responsible for accidents or failure to complete the well.", "annotations": [{"label": 18, "start_offset": 4, "end_offset": 13, "user": 1, "created_at": "2021-06-22T13:49:16.115705Z", "updated_at": "2021-06-22T13:49:16.115705Z"}, {"label": 18, "start_offset": 143, "end_offset": 152, "user": 1, "created_at": "2021-06-22T13:49:21.789703Z", "updated_at": "2021-06-22T13:49:21.789703Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 669, "text": "After a month's vigorous drilling Hicks led 5000 of his men against an equal force of dervishes in Sennar, whom he defeated, and cleared the country between the towns of Sennar and Khartum of rebels.", "annotations": [{"label": 18, "start_offset": 25, "end_offset": 34, "user": 1, "created_at": "2021-06-22T13:49:31.006714Z", "updated_at": "2021-06-22T13:49:31.006714Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 670, "text": "Although petroleum wells in Russia have not the depth of many of those in the United States, the disturbed character of the strata, with consequent liability to caving, and the occurrence of hard concretions, render drilling a lengthy and expensive Drilling in operation.", "annotations": [{"label": 18, "start_offset": 216, "end_offset": 225, "user": 1, "created_at": "2021-06-22T13:49:42.639661Z", "updated_at": "2021-06-22T13:49:42.639661Z"}, {"label": 18, "start_offset": 249, "end_offset": 258, "user": 1, "created_at": "2021-06-22T13:49:45.630662Z", "updated_at": "2021-06-22T13:49:45.630662Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 671, "text": "The outward set of teeth drill the hole large enough to permit the drilling apparatus to descend freely, and the teeth set inwardly pare down the core to such a diameter as will admit of the body of the cutter passing over it without seizing.", "annotations": [{"label": 18, "start_offset": 67, "end_offset": 76, "user": 1, "created_at": "2021-06-22T13:49:51.006714Z", "updated_at": "2021-06-22T13:49:51.006714Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 672, "text": "In conclusion it may be stated that the two systems of drilling for petroleum with which by far the largest amount of work has been, and is being done, are the American or rope Comparison system, and the Canadian or rod system.", "annotations": [{"label": 18, "start_offset": 55, "end_offset": 64, "user": 1, "created_at": "2021-06-22T13:49:57.981701Z", "updated_at": "2021-06-22T13:49:57.981701Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 673, "text": "At this period the supply of the raw material was insufficient to admit of any important development in the industry, and before the drilling of artesian wells for petroleum was initiated by Drake the coal-oil or shale-oil industry had assumed considerable proportions in the United States.", "annotations": [{"label": 18, "start_offset": 133, "end_offset": 142, "user": 1, "created_at": "2021-06-22T13:50:06.875707Z", "updated_at": "2021-06-22T13:50:06.875707Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 674, "text": "The increased mortality seems to be due to the general tendency toward forced speed in development work, which is secured by rapid drilling, and by an increase in the number of machine drills used in a single working-place.", "annotations": [{"label": 18, "start_offset": 131, "end_offset": 139, "user": 1, "created_at": "2021-06-22T13:50:12.199662Z", "updated_at": "2021-06-22T13:50:12.199662Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
{"id": 675, "text": "The horse-drawn hoe is steered by means of handles in the rear, but its successful working depends on accurate drilling of the seed, because unless the rows are parallel the roots of the plants are liable to be cut and the foliage injured.", "annotations": [{"label": 18, "start_offset": 111, "end_offset": 120, "user": 1, "created_at": "2021-06-22T13:50:18.822715Z", "updated_at": "2021-06-22T13:50:18.822715Z"}], "meta": {}, "annotation_approver": null, "comment_count": 0}
adrianeboyd commented 3 years ago

Hi, this isn't the right conversion script for your data. You setting doc.cats instead of doc.ents in the line doc.cats = line["annotations"].

See an example for entities here (you'll have to adjust it to read JSONL instead of JSON and to use the correct dict keys for the spans/offsets):

https://github.com/explosion/projects/blob/6e2a4ff98c2cfcda93431ffc9361470795609592/pipelines/ner_demo/scripts/convert.py

adrianeboyd commented 3 years ago

Let me convert this to a discussion. This issue will be locked but you can follow the link to the new thread.