explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.86k stars 4.38k forks source link

Multiple roots per sentence #4306

Closed ryszardtuora closed 5 years ago

ryszardtuora commented 5 years ago

How to reproduce the behaviour

I'm trying to train the parser for polish. I'm using the PDB treebank, in the conllu format (because it contains one sentence per paragraph, I've used the option --n-sents=1 while converting). The results are close to 84% on UAS, somewhat weaker than what I would expect (having used different parsers before), so I'm not sure if I'm doing everything right. The treebanks do contain few percent of nonprojective trees, so I apply the gold preprocessing. When I try to evaluate the results using the conllu ud 18 eval script, I get the error about multiple roots in a sentence. Indeed few out of 2 thousand sentences are parsed to include two roots. From my understanding spaCy mistakenly treats these sentences as documents composed of two separate sentences (despite the lack of '.'). This is somewhat undesirable, as I've used someone elses models for polish, and they do not have this problem. Would you have any ideas regarding how to fix this?

Your Environment

adrianeboyd commented 5 years ago

The conllu evaluation script should be able to handle differences in sentence segmentation, so there's probably something wrong with the output formatting. How are you exporting the parses to conllu after parsing with the model? Is there a blank line in between the two spacy sentences in the case that spacy has split one original sentence into two sentences?

When training a general-purpose model, it's usually better to have paragraph-like data where you set --n-sents to something like 10.

Alternatively (and I don't really recommend this but you can do it if you need it for a particular type of evaluation), you can force the parser to preserve the existing sentence segmentation by setting the property is_sent_start on each token. Set it to True for the first token in the sentence and False for any other tokens. It's probably easiest to create a tiny custom pipeline component that you add to the pipeline before the parser.

ryszardtuora commented 5 years ago

Ok, the problem no longer appears once I train the parser the "orthodox" way, I have satisfactory parsing results now. The problem is that I've been trying to avoid this, because I need to train the parser on one corpus (which sadly does not contain any documents longer than one sentence), and the tagger on another, much bigger corpus. I'm still not really sure how to do this, because as far as I understand, pipeline components may alter vector representations for their purposes, so training them in sequence is not really an option. Can you give me some advice on how to proceed now?

adrianeboyd commented 5 years ago

You should be able to train the tagger on one corpus:

spacy train -p tagger pl tagger-model train.json dev.json

And then use the tagger model as the base model when training the parser on different data:

spacy train -b tagger-model -p parser pl parser-model train.json dev.json

I would still recommend something like --n-sents 10 for the parser otherwise it won't learn how to set sentence boundaries very well. It would obviously be better if the data had real paragraphs, but this is also exactly how the distributed spacy models are trained if the corpora don't provide paragraph information.

ryszardtuora commented 5 years ago

Thank you for your quick reply!

That's what I've been doing, I'm trying to train the tagger now, based on an existing model (vocab+parser).

This is my exact command

spacy train pl outdir fullcorp_train.json fullcorp_test.json --base-model fasttextmodel --pipeline tagger --n-iter 100 --n-early-stopping 5 --gold-preproc -G --verbose -VV

Where fasttextmodel is simply the model-best folder generated while training the parser.

The error I get is this:

`Traceback (most recent call last): File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 257, in train losses=losses, File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/language.py", line 475, in update proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs) File "pipes.pyx", line 449, in spacy.pipeline.pipes.Tagger.update File "pipes.pyx", line 80, in spacy.pipeline.pipes.Pipe.require_model ValueError: [E109] Model for component 'tagger' not initialized. Did you forget to load a model, or forget to call begin_training()?

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/main.py", line 35, in plac.call(commands[command], sys.argv[1:]) File "/home/rtuora/.local/lib/python3.6/site-packages/plac_core.py", line 328, in call cmd, result = parser.consume(arglist) File "/home/rtuora/.local/lib/python3.6/site-packages/plac_core.py", line 207, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs) File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 363, in train with nlp.use_params(optimizer.averages): File "/usr/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/language.py", line 675, in use_params next(context) File "pipes.pyx", line 557, in use_params AttributeError: 'bool' object has no attribute 'use_params'`

BTW I've tried using --n-sents 10 for the parser, but the results I get are slightly worse (but maybe that's an issue of the evaluation method also using corpora with 1 sentence documents only).

adrianeboyd commented 5 years ago

Hmm, sorry, my first answer was wrong, since -b is just for extending existing pipeline components.

Even though it's the exact same situation for many distributed models (often the NER models aren't trained on the same data as the tagger/parser), I think the train CLI doesn't handle this case very well.

So this not a great answer, but I think you'll currently have to train the tagger and parser separately and combine them by hand. As long as you use the same vectors, you should be able to copy the parser model-best/parser folder into the tagger model-best folder. Double-check that the vocab/ folders in the two models are identical.

Then update meta.json to add parser to the pipeline so it knows to load it. You can also add all the accuracy info, but I think that's just metadata and not required for the model to work. This is basically how the distributed models are trained and assembled.

The tagger and parser models are independent and shouldn't modify the vocab/vectors during training.

Try it and see if you get the same results as with the independent tagger and parser models in your evaluations?

We should think about how/whether the train CLI could be improved to handle this, since it's a pretty common case.

adrianeboyd commented 5 years ago

Wait, I overlooked that it will update meta.json for you with -m. I still think you'll have to combine the model directories by hand, though. Because of how it saves the intermediate model results while training, you wouldn't want it to write to the same directory as your existing model. I'll try out a full example later today to be sure...

ryszardtuora commented 5 years ago

Vocab folders are not identical, as all files except for the vectors themselves differ in size. Nevertheless everything seems to work fine now. Thank you for your advice!

adrianeboyd commented 5 years ago

I do think that the train CLI should handle this case better, but I'll open that as a separate issue, since we've moved away from the original multiple roots problem.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.