explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

train_ner.py train data format to spaCy's json #5604

Closed fcggamou closed 4 years ago

fcggamou commented 4 years ago

Hi, I'm trying to use the CLI train command to train a NER model. I was able to train it following the example from train_ner.py on which the data needed to be formatted like this:

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

I now want to use the more powerful CLI.train command, but I have all my data in the format above, is there an existing script for this conversion? As far as I can see this isn't supported by CLI.convert

Thanks.

Your Environment

adrianeboyd commented 4 years ago

Here's my stackoverflow answer on how to do this: https://stackoverflow.com/a/59209377/461847

It would probably make sense to add an example script to do this, since this is the main missing step for people who want to move from the super simple example training scripts to real training with the train CLI.

fcggamou commented 4 years ago

Thanks a lot Adriane, sorry I missed the stack overflow answer. Indeed I agree this would be a good example to add, since probably the case of moving from the simple example to real training is very common. Wouldn't it make sense to add it into the CLI convert script as another supported format?

svlandeg commented 4 years ago

Wouldn't it make sense to add it into the CLI convert script as another supported format?

You're right that this has been lacking. For spaCy v.3, we're working on an overhaul of the convert function and the data formats in general, which should hopefully make all of this more intuitive!

github-actions[bot] commented 4 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.