dd-center / Bert_Danmaku_Embedding

Pre-trained Bert model for generating Danmaku Embedding
9 stars 2 forks source link

Confirm the way of using the retrained BERT #1

Open bayou3 opened 5 years ago

bayou3 commented 5 years ago

Hello, thank you for your sharing firstly! Based on your tuition, I use create_pretraining_data.py to get a file of tf_example.tfrecord. Then I run run_pretraining.py. After these two steps, I can get several files, of which I guess there are 3 important files for utilization later: model.ckpt-100000.data-00000-of-00001, model.ckpt-10000.index and model.ckpt-10000.meta ( "10000" means I trained them with num_train_steps=10000 ). ( My training is based on the original BERT model, not from the scratch.)

My question is: when I want to use this retrained BERT model (if I kept it in a directory of gs://xxx), then I should modify the INIT_CHECKPOINT as gs://xxx in my code, and my code also requires bert_config.json, so I use this file provided by google, namely "gs://cloud-tpu-checkpoints/bert/uncased_L-24_H-1024_A-16/bert_config.json", is what I do correct? Moreover, I wonder what size is right for num_train_steps? ( My input txt file contains 42854 articles. For each sentence in a single line and empty line splitting the articles, it has 524966 lines.) I ask these question because when I use the retrained BERT, my experimental result has no improvement.

pren1 commented 5 years ago

Hello, do you speak Chinese? If you do, please join our QQ group: 663606562. Maybe we can figure out where the problem is together. If you do not speak Chinese, then I would suggest you use the TSNE & PCA to map the 768 dim embedding vectors into 2 dim space and see if the result makes sense (Just as what I did in Danmaku_similarity.py). Also, have you taken a look at this tutorial (https://github.com/pren1/A_Pipeline_Of_Pretraining_Bert_On_Google_TPU.git)? Maybe that could help. Currently, I have no time to go over your problem in details, but I definitely will when I have time. Please let me know if you have more questions :D