Open jiqiujia opened 5 years ago
I also meet the same problem in small dataset.
me too
Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?
Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?
0.0.01a3 vesion
the result is print out by bert
cmd , no any modify.
Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?
I try using 0.0.1a4 and the result is the same
Hmmm... anyone have any clues?
I try using different data: continuous sentence pair from same document, concat continuous sentence as longer sentence , query and document pair, the result is the same. I also found that there is a big gap between next_loss and mask_loss although they use the same loss function.
Probably the criterion loss function is the problem.
# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014, -0.0002],
[-10.3151, -0.0000],
[ -8.8440, -0.0001],
[ -7.5148, -0.0005],
[-11.0145, -0.0000],
[-10.9770, -0.0000],
[-13.3770, -0.0000],
[ -9.5733, -0.0001],
[ -9.5957, -0.0001],
[ -9.0712, -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)
with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.
I changed following code in trainer/pretrain.py
:
self.criterion = nn.NLLLoss(ignore_index=0)
to:
self.criterion = nn.NLLLoss()
And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.
Probably the criterion loss function is the problem.
# shape [10, 2], not very accurate output out = torch.tensor([[ -8.4014, -0.0002], [-10.3151, -0.0000], [ -8.8440, -0.0001], [ -7.5148, -0.0005], [-11.0145, -0.0000], [-10.9770, -0.0000], [-13.3770, -0.0000], [ -9.5733, -0.0001], [ -9.5957, -0.0001], [ -9.0712, -0.0001]]) # shape [10], next sentence label label = torch.tensor([1,1,0,1,0,0,1,0,0,1]) original_criterion = nn.NLLLoss(ignore_index=0) criterion = nn.NLLLoss() original_loss = original_criterion(out, label) loss = criterion(out, label)
with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.
I changed following code in
trainer/pretrain.py
:self.criterion = nn.NLLLoss(ignore_index=0)
to:
self.criterion = nn.NLLLoss()
And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.
That's right. I just figure it out. Also note that for masklm, we still need ignore_index=0
since we only want to predict the masked words.
@cairoHy Wow thank you for your smart analysis.
I just fixed this issue on 0.0.1a5 version branch. And changes is under here.
Thanks everyone who join this investigation :) It was totally my fault and sorry for your inconvenience during bug fixing.
Additionally, is here anyone can test the new code with your own corpus? Any feedback would be welcome, and you can reinstall new version using under command.
git clone https://github.com/codertimo/BERT-pytorch.git
git checkout 0.0.1a5
pip install -U .
specially thanks for @jiqiujia @cairoHy @NiHaoUCAS @wenhaozheng-nju
@cairoHy after the modification, the model can't converge. Any suggestions?
@jiqiujia Can you tell me about the details? like figure or logs
@codertimo The loss just don't converge
bert-small-25-logs.txt This is the result of my 1M corpus with 1epoch, anyone can review this result
@codertimo Could you please show your parameters setting?
@yangze01 just default params with batch size 128
@codertimo I think these code have some errors, if len(t1) is longer than seq_len, the bert_input will only contains t1. and the length of segment_label also contains only the segment label of t1
I know but the line size of my corpus is usually less the 10 for each sentence. And seq_len should be properly set by the user. I don't think it's the bug, and not in this thread
@codertimo I think the sample of next sentence has a serious bug. Supposed 'B' is the next sentence of 'A', you may never sample the negative instance with 'A'.
@codertimo Suppose the dataset is: A \t B; B \t C; C \t D; D \t E; After your preprocessing: A \t B; B \t Random; C \t D; D \t Random; The negative instance "A \t Random" may never be sampled
@wenhaozheng-nju hmmm but do you think it's the main problem of this issue? I guess it's a model problem.
@codertimo Yes, the model should sample positive and negative instance for each sentence in the sentence pair classification problem. I think that the two task are the same.
@wenhaozheng-nju Then do you think if i change the negative sampling code as you requested, than this issue could be figure it out?
@codertimo I think everyone here wants to solve the problem, calm down, let's focus on the issue. @wenhaozheng-nju If you think it's the problem, you can try to modify the code and run.(but I think it's not the main problem. random negative sample is a commonly used strategy.)
I remove dropout in all layers and now my model success to converge. Maybe dropout in all layers is too big a regularization for small datasets? Or there is something wrong with dropout in this model implementation. After 900 epoch, my training dataset achieve an accuracy of 81%.
@wenhaozheng-nju if you have any other problem, please open another issue.
@jiqiujia Wow, it's cool. How long is the sentence of your corpus?
I set parameter --seq_len
to 32
@jiqiujia Looks pretty awesome!! Can you share the full training logs using file? And how much big is your corpus?? I would like to know the details. Thank you for your effort, it's really helpful to us
@jiqiujia I trained my dataset for 10hours last night, with dropout rate 0.0 (which is same with no dropout) and dropout rate 0.1. Unfortunately, both test loss was not coveraged.
@jiqiujia could you share more details? I trained with 1000000 samples, seq_len: 64, vocab_size: 100000 dropout = 0, but the result is the same as before.
my parameter settings is as follows, and I set next_setence loss's weight to be 5(It should be annealed, or set to 1 I think). I only have about 10000 sentence pairs and the vocab_size is about 4000. By the way, I also tried to test based on opennmt-py's tranformer implementation but it failed to converge. I noticed some different implementations. Transformer seems to be tricky.
I've tried some varied parameters and it seems that on my dataset, these parameter doesn't have much impact. Only dropout is critical. But my dataset is rather small. I choose a small dataset just to debug. I will tried some larger datasets. Hope it's helpful. You're welcomed to share your experiments.
And this is roughly the whole training log. The accuracy seems to be stuck at 81% finally. [Uploading _gaiastack_log_stdout (3).log…]()
It works well in my code. Acc rate got over 90.0
The base of code is version 0.0.1a3
.
I've changed 3 parts of this version of code.
First, set dropout off in every layers.
dropout = 0.0
Second, fix NLLLoss setting.
self.criterion = nn.NLLLoss(ignore_index=0)
to
self.criterion = nn.NLLLoss()
Third, fix prob
variable setting.
prob = random.random()
if prob < 0.15:
prob /= 0.15
# 80% randomly change token to mask token
if prob < 0.8:
tokens[i] = self.vocab.mask_index
# 10% randomly change token to random token
elif prob < 0.9:
tokens[i] = random.randrange(len(self.vocab))
After 999 epochs, the result as below
parameter setting is here
hidden=256
layers=8
attn_heads=8
seq_len=32
batch_size=256
epochs=1000
num_workers=5
with_cuda=True
log_freq=50
corpus_lines=None
lr=1e-4
adam_weight_decay=0.01
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0
Dataset is like this
Language : Japanese
Vocab size : 4670
Sentences amount : 1000
Of course, the changes that I wrote above have been already fixed in the latest version. But if you have not change some part of codes, It may not work well Please check it.
@Kosuke-Szk Thank you for sharing your result with us. After I saw @Kosuke-Szk 's result, I thought "Isn't our model is pretty small to train..?" As you guys know, we reduced our model to make them trainable using our GPU. And the training result was bad. However, the similar code (which is almost same with 0.0.1a4) works with smaller vocab size and dataset. So... If we make our model more bigger, than it's gonna be work? I thinks it's kind of underfitting... not just the problem of model. Anyone has idea about this issue?
Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.
I set weight decay = 0, I think use one of (dropout, weight decay) is enough.
@wangwei7175878 WOW this are brilliant, this is really huge step for us. Thank you for your effort and computation resource. Is there any result which used the weigth_decay
with default? And can you share the full log as a file??
How did you get the origin corpus? I tried very hard to get the corpus, but I failed... Even I sent the email to authors to get the origin corpus, but I failed. If it possible, can you share the origin corpus, so that I can test the real performance.
Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.
@wangwei7175878 Can you share your pre-trained model? I'm really looking froward to trying this out but I don't have that kind of processing power.
Thank you for your efforts.
@codertimo The model can't converge use weight_decay = 0.01. My dataset is not exactly the origin corpus, but I think it is almost the same. Wiki data can easily download from https://dumps.wikimedia.org/enwiki/ and you need a web spider to get bookscorpus from [https://www.smashwords.com/](https://www.smashwords.com/
@briandw My pre-trained model failed on downstream tasks(Fine-tune model can't converge). I will share the pre-trained model once it works.
@codertimo Here is the whole log. It took me almost one week to train about 250000 steps. The accuracy seems to be stuck at 91% which is reported as 98% in origin paper. log_run2_hhh_all_data_next_weight_1_no_decay.txt
@wangwei7175878 Can you share your code for crawling and preprocessing on above issue? Or if it possible can you share the full corpus with shared drive(dropbox, google drive etc). This would be really helpful to us.
@wangwei7175878 very interesting, authors said 0.01 weight decay is default parameter that they used. What's your parameter setting? it is same with default setting with our code except weigth_decay?
Hi there, I believe I fixed why model can’t converge with weight_decay = 0.01. Follow openai’s code here: I think BERT used adamW instead of adam. With rewriting this adam code in pytorch, my model can converge now with default setting.
@wangwei7175878 Sounds Great! Can you make a pull request with your adamW implementation? I'll test it on my corpus too 👍
I use my corpus, after three epochs, the acc rate is 73.54% .I set weight_dacay = 0. The other parameters are the default. Training continues.
Just for your reference. I also confirmed the accuracy increase following @Kosuke-Szk 's suggestion.
Though the model was resized to a really small one due to the memory limitation (< 12 GB), it still worked. Hyperparameters were:
hidden=240 #768
layers=3 #12
attn_heads=3 #12
seq_len=30 # 60
batch_size=8 #32
epochs=10
num_workers=4#5
with_cuda=True
log_freq=20
corpus_lines=None
lr=1e-3
adam_weight_decay=0.00
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0
min_freq=20 #7
I used 13 GB of Wikipedia English corpus with vocabulary size of 775k. But I stopped the job at just 2% progress of the first epoch because it said it would take thousands of hours.
Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.
I set weight decay = 0, I think use one of (dropout, weight decay) is enough.
Need ur machine, system and gpu configuration, thx.
And I've also made the wiki + bookcorpus data set, will publish the docs to help for reconstruction.
I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy.