codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.11k stars 1.29k forks source link

pred_loss decrease fast while avg_acc stay at 50% #32

Open jiqiujia opened 5 years ago

jiqiujia commented 5 years ago

I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy. image

wenhaozheng-nju commented 5 years ago

I also meet the same problem in small dataset.

NiHaoUCAS commented 5 years ago

me too

codertimo commented 5 years ago

Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?

NiHaoUCAS commented 5 years ago

Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?

0.0.01a3 vesion the result is print out by bert cmd , no any modify.

jiqiujia commented 5 years ago

Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?

I try using 0.0.1a4 and the result is the same

codertimo commented 5 years ago

Hmmm... anyone have any clues?

yangze01 commented 5 years ago

I try using different data: continuous sentence pair from same document, concat continuous sentence as longer sentence , query and document pair, the result is the same. I also found that there is a big gap between next_loss and mask_loss although they use the same loss function. image

cairoHy commented 5 years ago

Probably the criterion loss function is the problem.

# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014,  -0.0002],
        [-10.3151,  -0.0000],
        [ -8.8440,  -0.0001],
        [ -7.5148,  -0.0005],
        [-11.0145,  -0.0000],
        [-10.9770,  -0.0000],
        [-13.3770,  -0.0000],
        [ -9.5733,  -0.0001],
        [ -9.5957,  -0.0001],
        [ -9.0712,  -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)

with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.

I changed following code in trainer/pretrain.py:

self.criterion = nn.NLLLoss(ignore_index=0)

to:

self.criterion = nn.NLLLoss()

And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.

jiqiujia commented 5 years ago

Probably the criterion loss function is the problem.

# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014,  -0.0002],
        [-10.3151,  -0.0000],
        [ -8.8440,  -0.0001],
        [ -7.5148,  -0.0005],
        [-11.0145,  -0.0000],
        [-10.9770,  -0.0000],
        [-13.3770,  -0.0000],
        [ -9.5733,  -0.0001],
        [ -9.5957,  -0.0001],
        [ -9.0712,  -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)

with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.

I changed following code in trainer/pretrain.py:

self.criterion = nn.NLLLoss(ignore_index=0)

to:

self.criterion = nn.NLLLoss()

And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.

That's right. I just figure it out. Also note that for masklm, we still need ignore_index=0 since we only want to predict the masked words.

codertimo commented 5 years ago

@cairoHy Wow thank you for your smart analysis.

I just fixed this issue on 0.0.1a5 version branch. And changes is under here.

https://github.com/codertimo/BERT-pytorch/blob/2a0b28218f4fde216cbb7750eb584c2ada0d487b/bert_pytorch/trainer/pretrain.py#L61-L62

https://github.com/codertimo/BERT-pytorch/blob/2a0b28218f4fde216cbb7750eb584c2ada0d487b/bert_pytorch/trainer/pretrain.py#L98-L102

codertimo commented 5 years ago

Thanks everyone who join this investigation :) It was totally my fault and sorry for your inconvenience during bug fixing.

Additionally, is here anyone can test the new code with your own corpus? Any feedback would be welcome, and you can reinstall new version using under command.

git clone https://github.com/codertimo/BERT-pytorch.git
git checkout 0.0.1a5
pip install -U .

specially thanks for @jiqiujia @cairoHy @NiHaoUCAS @wenhaozheng-nju

jiqiujia commented 5 years ago

@cairoHy after the modification, the model can't converge. Any suggestions?

codertimo commented 5 years ago

@jiqiujia Can you tell me about the details? like figure or logs

jiqiujia commented 5 years ago

@codertimo The loss just don't converge image

codertimo commented 5 years ago

bert-small-25-logs.txt This is the result of my 1M corpus with 1epoch, anyone can review this result

yangze01 commented 5 years ago

@codertimo Could you please show your parameters setting?

codertimo commented 5 years ago

@yangze01 just default params with batch size 128

yangze01 commented 5 years ago

@codertimo I think these code have some errors, if len(t1) is longer than seq_len, the bert_input will only contains t1. and the length of segment_label also contains only the segment label of t1 image

codertimo commented 5 years ago

I know but the line size of my corpus is usually less the 10 for each sentence. And seq_len should be properly set by the user. I don't think it's the bug, and not in this thread

wenhaozheng-nju commented 5 years ago

@codertimo I think the sample of next sentence has a serious bug. Supposed 'B' is the next sentence of 'A', you may never sample the negative instance with 'A'.

codertimo commented 5 years ago

@wenhaozheng-nju I did negative sampling

https://github.com/codertimo/BERT-pytorch/blob/0d076e09fd5aef1601654fa0abfc2c7f0d57e5d9/bert_pytorch/dataset/dataset.py#L92-L99

https://github.com/codertimo/BERT-pytorch/blob/0d076e09fd5aef1601654fa0abfc2c7f0d57e5d9/bert_pytorch/dataset/dataset.py#L114-L125

wenhaozheng-nju commented 5 years ago

@codertimo Suppose the dataset is: A \t B; B \t C; C \t D; D \t E; After your preprocessing: A \t B; B \t Random; C \t D; D \t Random; The negative instance "A \t Random" may never be sampled

codertimo commented 5 years ago

@wenhaozheng-nju hmmm but do you think it's the main problem of this issue? I guess it's a model problem.

wenhaozheng-nju commented 5 years ago

@codertimo Yes, the model should sample positive and negative instance for each sentence in the sentence pair classification problem. I think that the two task are the same.

codertimo commented 5 years ago

@wenhaozheng-nju Then do you think if i change the negative sampling code as you requested, than this issue could be figure it out?

yangze01 commented 5 years ago

@codertimo I think everyone here wants to solve the problem, calm down, let's focus on the issue. @wenhaozheng-nju If you think it's the problem, you can try to modify the code and run.(but I think it's not the main problem. random negative sample is a commonly used strategy.)

jiqiujia commented 5 years ago

I remove dropout in all layers and now my model success to converge. Maybe dropout in all layers is too big a regularization for small datasets? Or there is something wrong with dropout in this model implementation. After 900 epoch, my training dataset achieve an accuracy of 81%. image

@wenhaozheng-nju if you have any other problem, please open another issue.

yangze01 commented 5 years ago

@jiqiujia Wow, it's cool. How long is the sentence of your corpus?

jiqiujia commented 5 years ago

I set parameter --seq_len to 32

codertimo commented 5 years ago

@jiqiujia Looks pretty awesome!! Can you share the full training logs using file? And how much big is your corpus?? I would like to know the details. Thank you for your effort, it's really helpful to us

codertimo commented 5 years ago

@jiqiujia I trained my dataset for 10hours last night, with dropout rate 0.0 (which is same with no dropout) and dropout rate 0.1. Unfortunately, both test loss was not coveraged.

2018-10-27 10 57 02
yangze01 commented 5 years ago

@jiqiujia could you share more details? I trained with 1000000 samples, seq_len: 64, vocab_size: 100000 dropout = 0, but the result is the same as before.

jiqiujia commented 5 years ago

my parameter settings is as follows, and I set next_setence loss's weight to be 5(It should be annealed, or set to 1 I think). I only have about 10000 sentence pairs and the vocab_size is about 4000. image By the way, I also tried to test based on opennmt-py's tranformer implementation but it failed to converge. I noticed some different implementations. Transformer seems to be tricky.

jiqiujia commented 5 years ago

I've tried some varied parameters and it seems that on my dataset, these parameter doesn't have much impact. Only dropout is critical. But my dataset is rather small. I choose a small dataset just to debug. I will tried some larger datasets. Hope it's helpful. You're welcomed to share your experiments.

jiqiujia commented 5 years ago

And this is roughly the whole training log. The accuracy seems to be stuck at 81% finally. [Uploading _gaiastack_log_stdout (3).log…]()

Kosuke-Szk commented 5 years ago

It works well in my code. Acc rate got over 90.0

The base of code is version 0.0.1a3. I've changed 3 parts of this version of code.

First, set dropout off in every layers. dropout = 0.0

Second, fix NLLLoss setting. self.criterion = nn.NLLLoss(ignore_index=0) to self.criterion = nn.NLLLoss()

Third, fix prob variable setting.

prob = random.random()
if prob < 0.15:
    prob /= 0.15

    # 80% randomly change token to mask token
    if prob < 0.8:
        tokens[i] = self.vocab.mask_index

    # 10% randomly change token to random token
    elif prob < 0.9:
        tokens[i] = random.randrange(len(self.vocab))

After 999 epochs, the result as below 2018-10-27 17 01 29

parameter setting is here

hidden=256
layers=8
attn_heads=8
seq_len=32
batch_size=256
epochs=1000
num_workers=5
with_cuda=True
log_freq=50
corpus_lines=None
lr=1e-4
adam_weight_decay=0.01
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0

Dataset is like this

Language : Japanese
Vocab size : 4670
Sentences amount : 1000

Of course, the changes that I wrote above have been already fixed in the latest version. But if you have not change some part of codes, It may not work well Please check it.

codertimo commented 5 years ago

@Kosuke-Szk Thank you for sharing your result with us. After I saw @Kosuke-Szk 's result, I thought "Isn't our model is pretty small to train..?" As you guys know, we reduced our model to make them trainable using our GPU. And the training result was bad. However, the similar code (which is almost same with 0.0.1a4) works with smaller vocab size and dataset. So... If we make our model more bigger, than it's gonna be work? I thinks it's kind of underfitting... not just the problem of model. Anyone has idea about this issue?

wangwei7175878 commented 5 years ago

Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.

2018-10-30 11 40 25

I set weight decay = 0, I think use one of (dropout, weight decay) is enough.

codertimo commented 5 years ago

@wangwei7175878 WOW this are brilliant, this is really huge step for us. Thank you for your effort and computation resource. Is there any result which used the weigth_decay with default? And can you share the full log as a file??

Origin corpus

How did you get the origin corpus? I tried very hard to get the corpus, but I failed... Even I sent the email to authors to get the origin corpus, but I failed. If it possible, can you share the origin corpus, so that I can test the real performance.

briandw commented 5 years ago

Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.

@wangwei7175878 Can you share your pre-trained model? I'm really looking froward to trying this out but I don't have that kind of processing power.

Thank you for your efforts.

wangwei7175878 commented 5 years ago

@codertimo The model can't converge use weight_decay = 0.01. My dataset is not exactly the origin corpus, but I think it is almost the same. Wiki data can easily download from https://dumps.wikimedia.org/enwiki/ and you need a web spider to get bookscorpus from [https://www.smashwords.com/](https://www.smashwords.com/

wangwei7175878 commented 5 years ago

@briandw My pre-trained model failed on downstream tasks(Fine-tune model can't converge). I will share the pre-trained model once it works.

wangwei7175878 commented 5 years ago

@codertimo Here is the whole log. It took me almost one week to train about 250000 steps. The accuracy seems to be stuck at 91% which is reported as 98% in origin paper. log_run2_hhh_all_data_next_weight_1_no_decay.txt

codertimo commented 5 years ago

@wangwei7175878 Can you share your code for crawling and preprocessing on above issue? Or if it possible can you share the full corpus with shared drive(dropbox, google drive etc). This would be really helpful to us.

codertimo commented 5 years ago

@wangwei7175878 very interesting, authors said 0.01 weight decay is default parameter that they used. What's your parameter setting? it is same with default setting with our code except weigth_decay?

wangwei7175878 commented 5 years ago

Hi there, I believe I fixed why model can’t converge with weight_decay = 0.01. Follow openai’s code here: I think BERT used adamW instead of adam. With rewriting this adam code in pytorch, my model can converge now with default setting.

codertimo commented 5 years ago

@wangwei7175878 Sounds Great! Can you make a pull request with your adamW implementation? I'll test it on my corpus too 👍

waynedane commented 5 years ago

I use my corpus, after three epochs, the acc rate is 73.54% .I set weight_dacay = 0. The other parameters are the default. Training continues.

shionhonda commented 5 years ago

Just for your reference. I also confirmed the accuracy increase following @Kosuke-Szk 's suggestion. loss acc

Though the model was resized to a really small one due to the memory limitation (< 12 GB), it still worked. Hyperparameters were:

hidden=240 #768
layers=3 #12
attn_heads=3 #12
seq_len=30 # 60
batch_size=8 #32
epochs=10
num_workers=4#5
with_cuda=True
log_freq=20
corpus_lines=None
lr=1e-3
adam_weight_decay=0.00
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0
min_freq=20 #7

I used 13 GB of Wikipedia English corpus with vocabulary size of 775k. But I stopped the job at just 2% progress of the first epoch because it said it would take thousands of hours.

zheolong commented 5 years ago

Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.

2018-10-30 11 40 25

I set weight decay = 0, I think use one of (dropout, weight decay) is enough.

Need ur machine, system and gpu configuration, thx.

And I've also made the wiki + bookcorpus data set, will publish the docs to help for reconstruction.