Closed Mistobaan closed 4 years ago
You're absolutely right that the unnamed tuples are very confusing and the classes for storing gold data are not always intuitive. To get to your docs from a JSON file, you'd need to do something like this:
corpus = GoldCorpus(str(json_path), str(json_path))
docs = list(corpus.train_docs(nlp))
The good news is, we've been refactoring the gold format a lot for the upcoming spaCy v.3 release. The main format for training, storing docs etc will be .spacy
files created by DocBin
, and there will be a specific json2docs
convertor. You can have a peek on the develop
branch if you're interested ;-)
Yes, I don't mind working from unstable code. some questions to get me more involved:
It may be a bit tricky to contribute to the develop
branch right now, as things are still changing slightly. We're also still working on the documentation but we're not quite there yet. If you would want to play with it, you can check out the CLI commands spacy init config
to get a basic config file, and call spacy train
with that config file. It might be worthwile to wait until we get a proper RC out though, I'm not sure.
In the meantime I'll close this specific issue if that's alright, as I think it's addressed and there's not really an action point for us any more.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Feature description
I have been trying the whole night to understand how to use Spacy for evaluation of the existing trained models for spacy and transformers.
I transformed my conll ner data to jsonl. But I could not figure out how to reconstruct a Doc back from the serialized string. Even the GoldParse format is very confusing on how it interacts with all these unnamed tuples. Doc has a
to_json
but does not have afrom_json
.Ideal scenario:
Am I missing something? This seems a pretty common scenario on inspecting the training set before even starting to do any sort of training. Any advice is welcome.