mixed_large_24_model.bin shows bad results when converted into TensorFlow version

nlp4869 commented 4 years ago

Hi, Thank you for providing mixed_large_24_model.bin. I was trying to convert this model using scripts/convert_bert_from_uer_to_google.py. Everything goes fine without any alerts. However, when I tried to fine-tune on a simple binary text classification task, it showed random results (i.e. acc=50%). I used Google's vocabulary and Bert-large config file and did not change other settings. I think it might be the conversion was not successful, and thus results in a bad performance. Any ideas?

ZhaoxinRuc commented 4 years ago

Maybe you didn't specify the correct parameters, you should add --layers_num 24 to the command. Because the default of layers_num is 12. If you have any question, please tell me.

nlp4869 commented 4 years ago

Hi @ZhaoxinRuc , Here is the command that I used to convert model:

python convert_bert_from_uer_to_google.py --layers_num=24 --input_model_path=./mixed_large_24_model.bin --output_model_path=./output/model.ckpt

Also, as I am using a CPU-only machine, I added map_location='cpu' in torch.load function: https://github.com/dbiir/UER-py/blob/master/scripts/convert_bert_from_uer_to_google.py#L21 I am not sure if this will result in a bad conversion. I am using tensorflow==1.14.0 and python 3.7.3 Thank you for your help.

nlp4869 commented 4 years ago

Update: Using GPU with original script (without map_location='cpu') under tf=1.13.1 and python3 still results in bad random results (tried with several classification tasks).

zhezhaoa commented 4 years ago

We have tested BERT-large model on Google BERT-tf and it performs normally. Could you give us more details, such as vocabulary (do you use the Chinese version?) and config file. Here are our commands:

python .\scripts\convert_bert_from_uer_to_google.py --layers_num 24 --input_model_path .\models\mixed_large_24_model.bin --output_model_path .\models\google_model_24.ckpt

Tensorflow: python run_classifier.py --task_name=BKRV --do_train=true --do_eval=true --data_dir=./datasets/book_review --vocab_file=./model/google_zh_vocab.txt --bert_config_file=.model/bert_config.json --init_checkpoint=./output_new/googel_model_24.ckpt --max_seq_length=128 --train_batch_size=4 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=/run/user/1010/tmp

Config file is as follows: { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "max_position_embeddings": 512, "num_attention_heads": 16, "num_hidden_layers": 24, "type_vocab_size": 2, "vocab_size": 21128 }

BKRV is a Chinese dataset and its label and text are divided by \t . Perhaps we can communicate in private. QQ number：1152543959

nlp4869 commented 4 years ago

I used original Chinese vocabulary (21,128 entries) and config file provided by Google official. I have been using original roberta-wwm-ext-large and could get an accuracy of ~95.7 on the ChnSentiCorp test set. But when I ONLY change the weight file to your model, the performance gets bad (almost random output). I checked the logs and found the weights are correctly initialized, so I think it must be the problem of the conversion process. I noticed that you've updated conversion script, so I regenerated your model into TensorFlow version, but it didn't help.

Here is the generated files:

ZhaoxinRuc commented 4 years ago

do you change the _runclassifier.py file in google-research/bert as follows:

class BkrvProcessor(DataProcessor):
  """Processor for the MRPC data set (GLUE version)."""

  def get_train_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

  def get_dev_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

  def get_test_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

  def get_labels(self):
    """See base class."""
    return ["0", "1"]

  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      if i == 0:
        continue
      guid = "%s-%s" % (set_type, i)
      text_a = tokenization.convert_to_unicode(line[1])
      if set_type == "test":
        label = "0"
      else:
        label = tokenization.convert_to_unicode(line[0])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)

  processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
      "xnli": XnliProcessor,
      "bkrv": BkrvProcessor,
  }

And this is my result in book_review dataset:

INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 0.8954 INFO:tensorflow: eval_loss = 0.5452643 INFO:tensorflow: global_step = 15000 INFO:tensorflow: loss = 0.5452643

ZhaoxinRuc commented 4 years ago

can you give me the config file you used in the experiment? this is my result in chnsenticorp dataset by using google_model_24.ckpt converted from mixed_large_24_model.bin.

INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 0.95 INFO:tensorflow: eval_loss = 0.29745975 INFO:tensorflow: global_step = 7200 INFO:tensorflow: loss = 0.29745975

nlp4869 commented 4 years ago

Config file as @zhezhaoa provided. { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "max_position_embeddings": 512, "num_attention_heads": 16, "num_hidden_layers": 24, "type_vocab_size": 2, "vocab_size": 21128 }

I assumed your model is identical to the roberta-wwm-ext-large, so I just replaced your model (converted by your script) and DID NOT change other configurations (such as config, vocabulary, hyper-params, etc.).

zhezhaoa commented 4 years ago

It seems that you use the correct vocabulary and config file. We download the lastest Google-BERT-tf and achieve normal results by loading our model. Perhaps we can communicate via mail since we don't think it is a general problem. 1152543959@qq.com

dbiir / UER-py

mixed_large_24_model.bin shows bad results when converted into TensorFlow version #37