Questions and Directives for Pretraining

pvcastro commented 4 years ago

My Environment

spaCy version: 2.2.2
Platform: Linux-5.0.0-25-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.8
Machine: AMD Ryzen Threadripper 2950X 16-Core Processor with 128 Gb RAM
GPU: 3 x RTX 2080ti

Hi there!

I'm trying to perform a pretraining for Portuguese, but I'm having a hard time getting the proper settings for my training. I have several corpora available I could use, ranging from a Wikipedia with around 230 million words to a domain specific corpus with 6 billion words. I first started with a general domain common crawl corpus with 3.5 billion words, but it was taking too long, even using GPU I was getting only around 3k wps. Increasing the batch size didn't help much. For every 5 rows of progress, I'm getting something like this:

0 163883838 33337653 10823 2269 0 163944872 33348885 11231 2304 0 164004651 33359701 10816 2258 0 164065755 33370757 11055 2297 0 164125760 33381669 10911 36745

Not sure what it means to get 4 logging steps with 2k wps each and one with 36k. This sample is using the default batch size. This ran for almost 24 hours and I couldn't even finish the first iteration.

Then I realized that I should start smaller, and got a small sample from this corpus, getting only the first 100.000 sentences from the corpus (1.1 million tokens). The pretrain ran smoothly, with an average of 30k wps at every progress log, and the default 1000 iterations were finished after about 15 hours.

This small sample of 100k sentences was enough to boost a NER benchmark we have for Portuguese from 61% using word embeddings to 65% using only the pretrained model (which I still find very low, since CNN architectures similar to the SENNA model from Collobert were already getting 71% in this benchmark, but this is a subject for another issue). For another domain specific NER corpus I have, I got a smaller boost from 90.88% using static word embeddings to 91.50% using the pretrained model alone.

Since I got relatively good results wich such a small corpus, I'm eager to train an optimal one. So I'm wondering how could I get the most out of this pretraining using larger corpora:

How big should be the corpus, at least?
Should I expect that a domain specific (legal) corpus would give me the best results for downstream tasks?
What should I do to get the best wps for the pretraining? I saw some reported 50k wps for every step of the iterations, how can I get this?
How many iterations are necessary to give the best results for downstream tasks, such as NER?
How do I pick the final model, from all the models available per iteration?

Thanks!

pvcastro commented 4 years ago

Anyone? :sweat_smile: @honnibal @adrianeboyd

I did some additional training on this and performing an extrinsic evaluation on the NER task, I'm not seeing any improvement on pretrained models obtained from later epochs. For the current model I'm training (which is currently on epoch 45), the best one for NER is still the model0.bin, from the first epoch.

For my current benchmark, I'm currently getting these results for NER F from spacy ner training:

No vectors / No language model: 87.74%
Pretrained Vectors / No language model: 88.27%
No vectors / Pretrained language model (wikipedia, first iteration, best current model): 88.63%

svlandeg commented 4 years ago

Hi @pvcastro! These kind of project-specific discussions are sometimes better held at a different forum, e.g. with a larger NLP/ML community. On this tracker we tend to focus more on bug reports and feature requests, and unfortunately we don't always have the bandwidth to discuss specific projects, use-cases and outcomes in detail.

It's also often difficult to form an opinion without having gone through the data. For instance, I would indeed expect a domain-specific corpus to be more helpful than a general one, but in any project I would probably try out both and perform a quick evaluation to make that final decision. Likewise, there is no universally valid answer to questions such as "how big should a corpus be" or "how many iterations should the training run" - I'm afraid you'll really have to determine these empirically.

Sorry to not be of much more help... It does look like you're obtaining some first promising results so I hope you'll be able to build on those :-)

If you do run into more specific technical issues with the code / library - feel free to open a new issue!

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Questions and Directives for Pretraining #4605

My Environment