dhlee347 / pytorchic-bert

Pytorch Implementation of Google BERT
Apache License 2.0
591 stars 179 forks source link

Can you give me some details about files? #6

Closed hufflepoohpooh closed 5 years ago

hufflepoohpooh commented 5 years ago

Thank you for your great code. I'm a student and a beginner of data analysis. I want to executive your code but I have some questions. It may be a silly question, but can you give me some details about files?


python pretrain.py \ --train_cfg config/pretrain.json \ --model_cfg config/bert_base.json \ --data_file $DATA_FILE \ --vocab $BERT_PRETRAIN/vocab.txt \ --save_dir $SAVE_DIR \ --max_len 512 \ --max_pred 20 \ --mask_prob 0.15

  1. config/pretrain.json
  2. config/bert_base.json
  3. $DATA_FILE
  4. $BERT_PRETRAIN/vocab.txt

We need a $DATA_FILE as a train set, but what is vocab.txt? I can get the vocab.txt file from google's github. Just use it? or Can I customize it?(Because I want to make a bert which has lower parameters than BERT-BASE.) Also, the ouput file model_steps_xxxx.pt is compatible with BERT in google's github?

Sorry I am not an expert, so maybe my questions are so silly. Thank you.

dhlee347 commented 5 years ago

Sorry for the late response,

  1. You can use vocab.txt from Google BERT's repo. It's risky to modify vocab.txt because it was learned from a corpus.
  2. the output file is not compatible with google's code.