MeRajat / SolvingAlmostAnythingWithBert

BioBert Pytorch
116 stars 33 forks source link

Unable to load 'weights/pytorch_weight' #4

Closed dkarmon closed 5 years ago

dkarmon commented 5 years ago

Hi @MeRajat, First, thanks for the great repository. I followed the Preparation instructions mentioned in the README file and converted the biobert weight file (specifically Pre-trained weight of BioBERT(Wiki+Books+PubMed+PMC) to be pytorch compatible using the script.

I keep getting the following error while trying to train the model on any valid dataset:

Traceback (most recent call last):
  File "new_train.py", line 13, in <module>
    tmp_d = torch.load(parameters.BERT_CONFIG_FILE, map_location='cpu')
  File "/data/anaconda/envs/py35/lib/python3.6/site-packages/torch/serialization.py", line 367, in load
    return _load(f, map_location, pickle_module)
  File "/data/anaconda/envs/py35/lib/python3.6/site-packages/torch/serialization.py", line 528, in _load
    magic_number = pickle_module.load(f)
_pickle.UnpicklingError: invalid load key, '{'.

It seems that both versions of the converted weight file are not serialized correctly. Please advise

MeRajat commented 5 years ago

sure @dkarmon let me look into that and get back to you.

MeRajat commented 5 years ago

Hi @dkarmon, Please ignore the single method in convert_tf_checkpoint_to_pytorch notebook.

I was able to convert these weight successfully for pytorch. It seems to an issue with writing weights to file.

dkarmon commented 5 years ago

@MeRajat apologies for that. I continue to investigate the problem and it seems to be at the bert_config.json file and not in the weights file. Are you able to run the following command without any errors? tmp_d = torch.load(parameters.BERT_CONFIG_FILE, map_location='cpu')

dkarmon commented 5 years ago

I think I found the problem tmp_d = torch.load(parameters.BERT_CONFIG_FILE, map_location='cpu') is supposed to get the converted weight file and not the config file. just replace parameters.BERT_CONFIG_FILE to parameters.BERT_WEIGHTS and it should work.

Also, note that the label set in data_load.py is missing a few labeling options ('S-Chemical', 'S-Disease', 'E-Disease', 'E-Chemical'), which makes the new_train.py file fail:

class HParams:
    def __init__(self, vocab_type):
        self.VOCAB_DICT = {
            'bc5cdr': ('<PAD>', 'B-Chemical', 'O', 'B-Disease', 'I-Disease', 'I-Chemical', 'S-Chemical', 'S-Disease',
                       'E-Disease', "E-Chemical"),
            'bionlp3g': ('<PAD>', 'B-Amino_acid', 'B-Anatomical_system', 'B-Cancer', 'B-Cell',
                         'B-Cellular_component', 'B-Developing_anatomical_structure', 'B-Gene_or_gene_product',
                         'B-Immaterial_anatomical_entity', 'B-Multi-tissue_structure', 'B-Organ', 'B-Organism',
                         'B-Organism_subdivision', 'B-Organism_substance', 'B-Pathological_formation',
                         'B-Simple_chemical', 'B-Tissue', 'I-Amino_acid', 'I-Anatomical_system', 'I-Cancer',
                         'I-Cell', 'I-Cellular_component', 'I-Developing_anatomical_structure',
                         'I-Gene_or_gene_product',
                         'I-Immaterial_anatomical_entity', 'I-Multi-tissue_structure', 'I-Organ', 'I-Organism',
                         'I-Organism_subdivision', 'I-Organism_substance', 'I-Pathological_formation',
                         'I-Simple_chemical',
                         'I-Tissue', 'O')
MeRajat commented 5 years ago

@dkarmon, in my case i didn't used E-Disease tags , that's why it is missing it from there. 👍