explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Doc.from_json ? #5935

Closed Mistobaan closed 4 years ago

Mistobaan commented 4 years ago

Feature description

I have been trying the whole night to understand how to use Spacy for evaluation of the existing trained models for spacy and transformers.

I transformed my conll ner data to jsonl. But I could not figure out how to reconstruct a Doc back from the serialized string. Even the GoldParse format is very confusing on how it interacts with all these unnamed tuples. Doc has a to_json but does not have a from_json.

Ideal scenario:

        python -m spacy convert \
            ${INPUT}/data/${DATASET}/${SPLIT}.iob \
            ${OUTPUT_TASK} \
            --file-type jsonl \
            --converter ner \
            --n 1 ;
import spacy

for doc in spacy.docs_from_jsonl(jsonl_dataset_location):
   visualize_ner(doc)
   break

Am I missing something? This seems a pretty common scenario on inspecting the training set before even starting to do any sort of training. Any advice is welcome.

svlandeg commented 4 years ago

You're absolutely right that the unnamed tuples are very confusing and the classes for storing gold data are not always intuitive. To get to your docs from a JSON file, you'd need to do something like this:

corpus = GoldCorpus(str(json_path), str(json_path))
docs = list(corpus.train_docs(nlp))

The good news is, we've been refactoring the gold format a lot for the upcoming spaCy v.3 release. The main format for training, storing docs etc will be .spacy files created by DocBin, and there will be a specific json2docs convertor. You can have a peek on the develop branch if you're interested ;-)

Mistobaan commented 4 years ago

Yes, I don't mind working from unstable code. some questions to get me more involved:

svlandeg commented 4 years ago

It may be a bit tricky to contribute to the develop branch right now, as things are still changing slightly. We're also still working on the documentation but we're not quite there yet. If you would want to play with it, you can check out the CLI commands spacy init config to get a basic config file, and call spacy train with that config file. It might be worthwile to wait until we get a proper RC out though, I'm not sure.

In the meantime I'll close this specific issue if that's alright, as I think it's addressed and there's not really an action point for us any more.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.