explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.83k stars 4.38k forks source link

Improvements on CLI training #3366

Closed Abhijit-2592 closed 5 years ago

Abhijit-2592 commented 5 years ago

I have been using the CLI training on the spacy-nightly versions. They are extremely powerful. There are 2 suggestions from my side:

  1. The choice of JSON as the dataset extension doesn't scale well when the dataset is larger than RAM i.e while creating custom dataset (spaCy JSON), I have to keep all the data in memory before I can dump them as a JSON as, JSON doesn't support lazy writing or dumping via python (I guess! But, not sure!). So how about a Binary file instead of a JSON? Or is there any other way and I am missing it?

  2. Since Displacy can provide HTML I guess there are ways to visualize these using tensorboard as tf.summary.text supports markdowns. Moreover streaming the loss and metrics to tensorboard will also be useful I guess.

I am currently working on the second task. Will update if something comes out

ines commented 5 years ago

Thanks for your feedback and I agree, there's definitely room for improvement here 👍

JSON doesn't support lazy writing or dumping via python (I guess! But, not sure!). So how about a Binary file instead of a JSON? Or is there any other way and I am missing it?

Yeah, using JSON wasn't the best choice and spaCy actually does a bunch of stuff under the hood to stream in JSON line by line and avoid loading it all into memory at once (see here for the code). However, going forward, we want to change the training data format over to JSONL (newline-delimited JSON) and also use a more straightforward, consistent structure to provide the annotations. See #2928 for the proposal and examples.

We didn't want to rush things so that change won't make it into v2.1.0. But it'll probably be the next thing on our list once the new stable version is out.

Btw, if you want to test new CLI stuff, you could also give the debug-data command a try (implementation here). It'd be nice to hear how it works for other people's data. It's already included in the nightly but still very experimental and not officially documented yet. The JSON validation also isn't implemented yet, since that needs the new JSONL data format described in #2928.

Since Displacy can provide HTML I guess there are ways to visualize these using tensorboard as tf.summary.text supports markdowns. Moreover streaming the loss and metrics to tensorboard will also be useful I guess.

Yes, that'd be super cool! If you end up with a good solution for this, we'd definitely appreciated a pull request!

Abhijit-2592 commented 5 years ago

@ines Thanks I will definitely give the new one a try. But, my original question still remains, How will I create a JSON file if my entire data does not fit into memory? While training there is lazy loading but, the problem is while creating the JSON data.

ines commented 5 years ago

Ah, I just realised this should probably be more clear in the docs: the train_path and dev_path can take either a single file or a directory of files. So for large datasets, you should be able to split your data into separate files.

If you do want one file: It's not very elegant, but I guess you could also hack it together with strings? Load parts of the data, dump it as JSON, write it to a file. Load more records, dump them as objects, add commas in between and insert before the final closing ] in your file. Repeat until all data is in there.

Abhijit-2592 commented 5 years ago

@ines Thanks! this is awesome!

mehmetilker commented 5 years ago

Hi @ines

When I tried new CLI training I am getting following exception. In summary, convert tooling produces jsonl document but train does not recognize jsonl.

If new CLI won't make 2.1, expected behavior should be same with 2.0 I assume. Or did I understand something wrong above?

python -m spacy train es models ancora-json/es_ancora-ud-train.jsonl ancora-json/es_ancora-ud-dev.jsonl
Training pipeline: ['tagger', 'parser', 'ner']
Starting with blank model 'es'
Counting training words (limit=0)
Traceback (most recent call last):
  File "C:\Users\imete\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\imete\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\_Dev\_dev\temp\spatest\_env\lib\site-packages\spacy\__main__.py", line 38, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "C:\_Dev\_dev\temp\spatest\_env\lib\site-packages\plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "C:\_Dev\_dev\temp\spatest\_env\lib\site-packages\plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\_Dev\_dev\temp\spatest\_env\lib\site-packages\spacy\cli\train.py", line 185, in train
    corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
  File "gold.pyx", line 108, in spacy.gold.GoldCorpus.__init__
  File "gold.pyx", line 119, in spacy.gold.GoldCorpus.write_msgpack
  File "gold.pyx", line 156, in read_tuples
ValueError: Cannot read from file: ancora-json\es_ancora-ud-train.jsonl. Supported formats: .json, .msg

After having problem with 2.0.18 to train a model (detailed here) I wanted to try 2.1 https://github.com/explosion/spaCy/issues/3056#issuecomment-470068902

ines commented 5 years ago

In summary, convert tooling produces jsonl document but train does not recognize jsonl.

Thanks for the report and yes, looks like we made the convert command default to JSONL too soon befor actually implementing that in spacy.gold. Just had a look and it should be pretty easy to add – so even if v2.1.0 will still use the old format, you'll at least be able to stream it in from JSONL.

In the meantime, try setting --file-type json on spacy convert.

(Btw, I totally forgot that the gold corpus can read from .msg, so msgpack files. So we can also support this as an option in spacy convert. The new version will use our own library srsly for serialization stuff, so reading and writing those files is a lot more straightforward now.)

mehmetilker commented 5 years ago

Thanks @ines for your quick reply. --file-type json did the job. But same problem with 2.1 as well. I can not get pos and tag attributes after training a model with UD Turkish Treebank I have detailed here: https://github.com/explosion/spaCy/issues/3056#issuecomment-470068902

Should I open a new issue or is it something simple that I am missing? I will try same procedure with Spanish treebank to see if Turkish ISTM have problem but training did not finish yet.

ines commented 5 years ago

I'll close this issue, since #3374 addresses most of the training data format enhancements discussed here.

@Abhijit-2592 Definitely keep us updated on the progress regarding the TensorBoard stuff and feel free to open a PR if there's something you want to suggest for this 👍

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.