explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.2k stars 4.4k forks source link

"invalid whitespace entity spans" error while validation training and test data for NER #13689

Open abrarsharif66 opened 2 days ago

abrarsharif66 commented 2 days ago

How to reproduce the behaviour

I have use the following piece of code to convert json to spacy while validationg using spacy --debug i get whitespace error:

image

please help me how to resolve this

for text, annot in tqdm(TRAIN_DATA['annotations']): doc = nlp.make_doc(text) ents = [] for start, end, label in annot["entities"]: span = doc.char_span(start, end, label=label, alignment_mode="contract") if span is None: print("Skipping entity") else: ents.append(span) doc.ents = ents db.add(doc) db.to_disk("train_data.spacy")

Info about spaCy

abrarsharif66 commented 2 days ago

sample JSON file of my train data for better understanding of schema:

{"classes":["SOFTWARE_NAME","JOB_TYPE","EDUCATION","UNIVERSITY","DEGREE","YEARS_OF_EXPERIENCE","STATE","CITY","COUNTRY","PROGRAMING_CONCEPT","COMPANY_NAME","PROGRAMMING_LANGUAGE","FRAMEWORKS","SOFT_SKILLS","JOB_TITLE","NAME","EMAIL","PH.NO"],"annotations":[["Zixuan Wu zixwu@ucdavis.edu",{"entities":[[0,9,"NAME"],[10,27,"EMAIL"]]}],["1363 Briones Ct | Pleasanton, CA 94588 | (510) 676-7461",{"entities":[[41,55,"PH.NO"]]}]]}