Closed Abhijit-2592 closed 5 years ago
Thanks for your feedback and I agree, there's definitely room for improvement here 👍
JSON doesn't support lazy writing or dumping via python (I guess! But, not sure!). So how about a Binary file instead of a JSON? Or is there any other way and I am missing it?
Yeah, using JSON wasn't the best choice and spaCy actually does a bunch of stuff under the hood to stream in JSON line by line and avoid loading it all into memory at once (see here for the code). However, going forward, we want to change the training data format over to JSONL (newline-delimited JSON) and also use a more straightforward, consistent structure to provide the annotations. See #2928 for the proposal and examples.
We didn't want to rush things so that change won't make it into v2.1.0. But it'll probably be the next thing on our list once the new stable version is out.
Btw, if you want to test new CLI stuff, you could also give the debug-data
command a try (implementation here). It'd be nice to hear how it works for other people's data. It's already included in the nightly but still very experimental and not officially documented yet. The JSON validation also isn't implemented yet, since that needs the new JSONL data format described in #2928.
Since Displacy can provide HTML I guess there are ways to visualize these using tensorboard as tf.summary.text supports markdowns. Moreover streaming the loss and metrics to tensorboard will also be useful I guess.
Yes, that'd be super cool! If you end up with a good solution for this, we'd definitely appreciated a pull request!
@ines Thanks I will definitely give the new one a try. But, my original question still remains, How will I create a JSON file if my entire data does not fit into memory? While training there is lazy loading but, the problem is while creating the JSON data.
Ah, I just realised this should probably be more clear in the docs: the train_path
and dev_path
can take either a single file or a directory of files. So for large datasets, you should be able to split your data into separate files.
If you do want one file: It's not very elegant, but I guess you could also hack it together with strings? Load parts of the data, dump it as JSON, write it to a file. Load more records, dump them as objects, add commas in between and insert before the final closing ]
in your file. Repeat until all data is in there.
@ines Thanks! this is awesome!
Hi @ines
When I tried new CLI training I am getting following exception. In summary, convert tooling produces jsonl document but train does not recognize jsonl.
If new CLI won't make 2.1, expected behavior should be same with 2.0 I assume. Or did I understand something wrong above?
python -m spacy train es models ancora-json/es_ancora-ud-train.jsonl ancora-json/es_ancora-ud-dev.jsonl
Training pipeline: ['tagger', 'parser', 'ner']
Starting with blank model 'es'
Counting training words (limit=0)
Traceback (most recent call last):
File "C:\Users\imete\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\imete\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\_Dev\_dev\temp\spatest\_env\lib\site-packages\spacy\__main__.py", line 38, in <module>
plac.call(commands[command], sys.argv[1:])
File "C:\_Dev\_dev\temp\spatest\_env\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "C:\_Dev\_dev\temp\spatest\_env\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "C:\_Dev\_dev\temp\spatest\_env\lib\site-packages\spacy\cli\train.py", line 185, in train
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
File "gold.pyx", line 108, in spacy.gold.GoldCorpus.__init__
File "gold.pyx", line 119, in spacy.gold.GoldCorpus.write_msgpack
File "gold.pyx", line 156, in read_tuples
ValueError: Cannot read from file: ancora-json\es_ancora-ud-train.jsonl. Supported formats: .json, .msg
After having problem with 2.0.18 to train a model (detailed here) I wanted to try 2.1 https://github.com/explosion/spaCy/issues/3056#issuecomment-470068902
In summary, convert tooling produces jsonl document but train does not recognize jsonl.
Thanks for the report and yes, looks like we made the convert
command default to JSONL too soon befor actually implementing that in spacy.gold
. Just had a look and it should be pretty easy to add – so even if v2.1.0 will still use the old format, you'll at least be able to stream it in from JSONL.
In the meantime, try setting --file-type json
on spacy convert
.
(Btw, I totally forgot that the gold corpus can read from .msg
, so msgpack files. So we can also support this as an option in spacy convert
. The new version will use our own library srsly
for serialization stuff, so reading and writing those files is a lot more straightforward now.)
Thanks @ines for your quick reply. --file-type json did the job. But same problem with 2.1 as well. I can not get pos and tag attributes after training a model with UD Turkish Treebank I have detailed here: https://github.com/explosion/spaCy/issues/3056#issuecomment-470068902
Should I open a new issue or is it something simple that I am missing? I will try same procedure with Spanish treebank to see if Turkish ISTM have problem but training did not finish yet.
I'll close this issue, since #3374 addresses most of the training data format enhancements discussed here.
@Abhijit-2592 Definitely keep us updated on the progress regarding the TensorBoard stuff and feel free to open a PR if there's something you want to suggest for this 👍
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I have been using the CLI training on the spacy-nightly versions. They are extremely powerful. There are 2 suggestions from my side:
The choice of JSON as the dataset extension doesn't scale well when the dataset is larger than RAM i.e while creating custom dataset (spaCy JSON), I have to keep all the data in memory before I can dump them as a JSON as, JSON doesn't support lazy writing or dumping via python (I guess! But, not sure!). So how about a Binary file instead of a JSON? Or is there any other way and I am missing it?
Since Displacy can provide HTML I guess there are ways to visualize these using tensorboard as tf.summary.text supports markdowns. Moreover streaming the loss and metrics to tensorboard will also be useful I guess.
I am currently working on the second task. Will update if something comes out