Closed pvcastro closed 4 years ago
Anyone? :sweat_smile: @honnibal @adrianeboyd
I did some additional training on this and performing an extrinsic evaluation on the NER task, I'm not seeing any improvement on pretrained models obtained from later epochs. For the current model I'm training (which is currently on epoch 45), the best one for NER is still the model0.bin, from the first epoch.
For my current benchmark, I'm currently getting these results for NER F from spacy ner training:
Hi @pvcastro! These kind of project-specific discussions are sometimes better held at a different forum, e.g. with a larger NLP/ML community. On this tracker we tend to focus more on bug reports and feature requests, and unfortunately we don't always have the bandwidth to discuss specific projects, use-cases and outcomes in detail.
It's also often difficult to form an opinion without having gone through the data. For instance, I would indeed expect a domain-specific corpus to be more helpful than a general one, but in any project I would probably try out both and perform a quick evaluation to make that final decision. Likewise, there is no universally valid answer to questions such as "how big should a corpus be" or "how many iterations should the training run" - I'm afraid you'll really have to determine these empirically.
Sorry to not be of much more help... It does look like you're obtaining some first promising results so I hope you'll be able to build on those :-)
If you do run into more specific technical issues with the code / library - feel free to open a new issue!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
My Environment
Hi there!
I'm trying to perform a pretraining for Portuguese, but I'm having a hard time getting the proper settings for my training. I have several corpora available I could use, ranging from a Wikipedia with around 230 million words to a domain specific corpus with 6 billion words. I first started with a general domain common crawl corpus with 3.5 billion words, but it was taking too long, even using GPU I was getting only around 3k wps. Increasing the batch size didn't help much. For every 5 rows of progress, I'm getting something like this:
Not sure what it means to get 4 logging steps with 2k wps each and one with 36k. This sample is using the default batch size. This ran for almost 24 hours and I couldn't even finish the first iteration.
Then I realized that I should start smaller, and got a small sample from this corpus, getting only the first 100.000 sentences from the corpus (1.1 million tokens). The pretrain ran smoothly, with an average of 30k wps at every progress log, and the default 1000 iterations were finished after about 15 hours.
This small sample of 100k sentences was enough to boost a NER benchmark we have for Portuguese from 61% using word embeddings to 65% using only the pretrained model (which I still find very low, since CNN architectures similar to the SENNA model from Collobert were already getting 71% in this benchmark, but this is a subject for another issue). For another domain specific NER corpus I have, I got a smaller boost from 90.88% using static word embeddings to 91.50% using the pretrained model alone.
Since I got relatively good results wich such a small corpus, I'm eager to train an optimal one. So I'm wondering how could I get the most out of this pretraining using larger corpora:
Thanks!