Closed ryszardtuora closed 5 years ago
The conllu evaluation script should be able to handle differences in sentence segmentation, so there's probably something wrong with the output formatting. How are you exporting the parses to conllu after parsing with the model? Is there a blank line in between the two spacy sentences in the case that spacy has split one original sentence into two sentences?
When training a general-purpose model, it's usually better to have paragraph-like data where you set --n-sents
to something like 10.
Alternatively (and I don't really recommend this but you can do it if you need it for a particular type of evaluation), you can force the parser to preserve the existing sentence segmentation by setting the property is_sent_start
on each token. Set it to True
for the first token in the sentence and False
for any other tokens. It's probably easiest to create a tiny custom pipeline component that you add to the pipeline before the parser.
Ok, the problem no longer appears once I train the parser the "orthodox" way, I have satisfactory parsing results now. The problem is that I've been trying to avoid this, because I need to train the parser on one corpus (which sadly does not contain any documents longer than one sentence), and the tagger on another, much bigger corpus. I'm still not really sure how to do this, because as far as I understand, pipeline components may alter vector representations for their purposes, so training them in sequence is not really an option. Can you give me some advice on how to proceed now?
You should be able to train the tagger on one corpus:
spacy train -p tagger pl tagger-model train.json dev.json
And then use the tagger model as the base model when training the parser on different data:
spacy train -b tagger-model -p parser pl parser-model train.json dev.json
I would still recommend something like --n-sents 10
for the parser otherwise it won't learn how to set sentence boundaries very well. It would obviously be better if the data had real paragraphs, but this is also exactly how the distributed spacy models are trained if the corpora don't provide paragraph information.
Thank you for your quick reply!
That's what I've been doing, I'm trying to train the tagger now, based on an existing model (vocab+parser).
This is my exact command
spacy train pl outdir fullcorp_train.json fullcorp_test.json --base-model fasttextmodel --pipeline tagger --n-iter 100 --n-early-stopping 5 --gold-preproc -G --verbose -VV
Where fasttextmodel is simply the model-best folder generated while training the parser.
The error I get is this:
`Traceback (most recent call last): File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 257, in train losses=losses, File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/language.py", line 475, in update proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs) File "pipes.pyx", line 449, in spacy.pipeline.pipes.Tagger.update File "pipes.pyx", line 80, in spacy.pipeline.pipes.Pipe.require_model ValueError: [E109] Model for component 'tagger' not initialized. Did you forget to load a model, or forget to call begin_training()?
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/main.py", line 35, in
BTW I've tried using --n-sents 10
for the parser, but the results I get are slightly worse (but maybe that's an issue of the evaluation method also using corpora with 1 sentence documents only).
Hmm, sorry, my first answer was wrong, since -b
is just for extending existing pipeline components.
Even though it's the exact same situation for many distributed models (often the NER models aren't trained on the same data as the tagger/parser), I think the train CLI doesn't handle this case very well.
So this not a great answer, but I think you'll currently have to train the tagger and parser separately and combine them by hand. As long as you use the same vectors, you should be able to copy the parser model-best/parser
folder into the tagger model-best
folder. Double-check that the vocab/
folders in the two models are identical.
Then update meta.json
to add parser
to the pipeline
so it knows to load it. You can also add all the accuracy info, but I think that's just metadata and not required for the model to work. This is basically how the distributed models are trained and assembled.
The tagger and parser models are independent and shouldn't modify the vocab/vectors during training.
Try it and see if you get the same results as with the independent tagger and parser models in your evaluations?
We should think about how/whether the train CLI could be improved to handle this, since it's a pretty common case.
Wait, I overlooked that it will update meta.json
for you with -m
. I still think you'll have to combine the model directories by hand, though. Because of how it saves the intermediate model results while training, you wouldn't want it to write to the same directory as your existing model. I'll try out a full example later today to be sure...
Vocab folders are not identical, as all files except for the vectors themselves differ in size. Nevertheless everything seems to work fine now. Thank you for your advice!
I do think that the train CLI should handle this case better, but I'll open that as a separate issue, since we've moved away from the original multiple roots problem.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
I'm trying to train the parser for polish. I'm using the PDB treebank, in the conllu format (because it contains one sentence per paragraph, I've used the option --n-sents=1 while converting). The results are close to 84% on UAS, somewhat weaker than what I would expect (having used different parsers before), so I'm not sure if I'm doing everything right. The treebanks do contain few percent of nonprojective trees, so I apply the gold preprocessing. When I try to evaluate the results using the conllu ud 18 eval script, I get the error about multiple roots in a sentence. Indeed few out of 2 thousand sentences are parsed to include two roots. From my understanding spaCy mistakenly treats these sentences as documents composed of two separate sentences (despite the lack of '.'). This is somewhat undesirable, as I've used someone elses models for polish, and they do not have this problem. Would you have any ideas regarding how to fix this?
Your Environment