XuezheMax / NeuroNLP2

Deep neural models for core NLP tasks (Pytorch version)
GNU General Public License v3.0
441 stars 89 forks source link

Training for Portuguese #16

Closed pvcastro closed 3 years ago

pvcastro commented 6 years ago

Hi @XuezheMax,

In his LSTM-CRF architecture, Lample has a parameter to perform the lowercasing of words before they are lookup in the embeddings table. This is particularly good for my Portuguese training because I'm using pre-trained embeddings that have only lowercase words. So, if a word is not found in the embeddings table because it starts with an upper case letter, it would end up hitting the UNK vector.

Do you have support for this in your model? If not, can you indicate where should I makes changes to consider this?

Thanks!

XuezheMax commented 6 years ago

Actually, I have considered this, but in a little bit different way. When I create the vocab, the capital info is kept. But when the model lookup the embeddings for a certain word, it first lookup the original word, if it is not in the pre-trained embedding table, it will lower-case the word and lookup again. Does this patten work for you?

On Thu, Jun 7, 2018 at 12:44 PM, Pedro Vitor Quinta de Castro < notifications@github.com> wrote:

Hi @XuezheMax https://github.com/XuezheMax,

In his LSTM-CRF architecture https://github.com/glample/tagger, Lample has a parameter to perform the lowercasing of words before they are lookup in the embeddings table. This is particularly good for my Portuguese training because I'm using pre-trained embeddings that have only lowercase words. So, if a word is not found in the embeddings table because it starts with an upper case letter, it would end up hitting the UNK vector.

Do you have support for this in your model? If not, can you indicate where should I makes changes to consider this?

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/XuezheMax/NeuroNLP2/issues/16, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUtlmSKP1r53N5o4eFA7hutoNe9MAm4ks5t6YKbgaJpZM4UfAVI .

--

Best regards, Ma,Xuezhe Language Technologies Institute, School of Computer Science, Carnegie Mellon University Tel: +1 206-512-5977

pvcastro commented 6 years ago

Using Lample's LSTM-CRF I got an F1 score around 76% for Portuguese corpus, but got only 71% using yours. One of the parameters in his model that made quite a difference was lowering the case before looking up the embeddings. If you think that your model already accounts for this, preventing not finding the embeddings because of the case, then I think I should try tuning other parameters for Portuguese. I suspect the score is a lot lower than English because the training corpus is a lot smaller (4600 sentences for training and around 2000 for testing, no dev set), and the character vocabulary is a lot bigger, 129 characters. Portuguese has many more characters than English. The Portuguese embeddings are also much more extensive, with over 900.000 lines in it. Do you have any suggestions as to what I could tune in your network to improve the score?

XuezheMax commented 6 years ago

Sorry for the late reply.

Yes, for NER on "low-resource" languages, we have to tune the model because the small datasets. One suggestion is to see the script 'run_ner_ger.sh' in the same folder. It is the hyper-parameters I used for German and it outperformed Lample's model.

Another possible way you can try is to (manually) lower-case the words to reduce the size of vocab. I might work for certain languages and/or datasets.

On Thu, Jun 7, 2018 at 4:16 PM, Pedro Vitor Quinta de Castro < notifications@github.com> wrote:

Using Lample's LSTM-CRF I got an F1 score around 76% for Portuguese corpus, but got only 71% using yours. One of the parameters in his model that made quite a difference was lowering the case before looking up the embeddings. If you think that your model already accounts for this, preventing not finding the embeddings because of the case, then I think I should try tuning other parameters for Portuguese. I suspect the score is a lot lower than English because the training corpus is a lot smaller (4600 sentences for training and around 2000 for testing, no dev set), and the character vocabulary is a lot bigger, 129 characters. Portuguese has many more characters than English. The Portuguese embeddings are also much more extensive, with over 900.000 lines in it. Do you have any suggestions as to what I could tune in your network to improve the score?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/XuezheMax/NeuroNLP2/issues/16#issuecomment-395595242, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUtliP-cTTgDtR2wkvB-zE_yOUKpgx1ks5t6bRjgaJpZM4UfAVI .

--

Best regards, Ma,Xuezhe Language Technologies Institute, School of Computer Science, Carnegie Mellon University Tel: +1 206-512-5977

pvcastro commented 6 years ago

I ran a couple of times with the hyperparameters from run_ner_ger and couldn't even get to 71%, the default parameters actually were slightly better, getting passed 71.

FYI, there are the alphabet details:

2018-06-11 09:31:00,488 - NERCRF - INFO - Creating Alphabets 2018-06-11 09:31:00,521 - Create Alphabets - INFO - Word Alphabet Size (Singleton): 20970 (8360) 2018-06-11 09:31:00,521 - Create Alphabets - INFO - Character Alphabet Size: 129 2018-06-11 09:31:00,521 - Create Alphabets - INFO - POS Alphabet Size: 2 2018-06-11 09:31:00,521 - Create Alphabets - INFO - Chunk Alphabet Size: 2 2018-06-11 09:31:00,521 - Create Alphabets - INFO - NER Alphabet Size: 12 2018-06-11 09:31:00,521 - NERCRF - INFO - Word Alphabet Size: 20970 2018-06-11 09:31:00,521 - NERCRF - INFO - Character Alphabet Size: 129 2018-06-11 09:31:00,521 - NERCRF - INFO - POS Alphabet Size: 2 2018-06-11 09:31:00,521 - NERCRF - INFO - Chunk Alphabet Size: 2 2018-06-11 09:31:00,521 - NERCRF - INFO - NER Alphabet Size: 12 2018-06-11 09:31:00,521 - NERCRF - INFO - Reading Data Reading data from data/conll2003/portuguese/filtered_train.txt Total number of data: 4749 Reading data from data/conll2003/portuguese/filtered_test.txt Total number of data: 2087 oov: 67