UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.99k stars 2.45k forks source link

bert word embedding model from a local bin file #155

Open aaa29 opened 4 years ago

aaa29 commented 4 years ago

when i am trying to load the model from the local file using the path with this instruction: word_embedding_model = models.BERT('models/bert-base-uncased-pytorch_model.bin') i am getting this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

nreimers commented 4 years ago

Hi, models.BERT must point to a folder, which contains the files (BERT model as well as tokenizer information) that are stored with huggingface transformers. You cannot load a tensorflow bert model directly. First, it must be converted to pytorch. Then you must add the right config + tokenizer filers.

Have look here at zips: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/

They contain a folder 0_BERT. You folder must have the same files.

Best Nils Reimers

ayanyuegupta commented 4 years ago

Hi,

Am having a similar problem with an ALBERT model I fine tuned.

I fine tuned it using pytorch and hugging face and saved the model using torch.save. When I try and load it with models.ALBERT I get the same UnicodeDecodeError.

I looked at the zip file you linked but I'm not sure how to produce those config files for an ALBERT model fine-tuned on a custom corpus.

Here are details of my fine tuned model if it helps:

  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm(torch.Size([128]), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0)
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
              )
              (ffn): Linear(in_features=768, out_features=3072, bias=True)
              (ffn_output): Linear(in_features=3072, out_features=768, bias=True)
            )
          )
        )
      )
    )
    (pooler): Linear(in_features=768, out_features=768, bias=True)
    (pooler_activation): Tanh()
  )
  (predictions): AlbertMLMHead(
    (LayerNorm): LayerNorm(torch.Size([128]), eps=1e-05, elementwise_affine=True)
    (dense): Linear(in_features=768, out_features=128, bias=True)
    (decoder): Linear(in_features=128, out_features=30000, bias=True)
  )
)
ayanyuegupta commented 4 years ago

FINAL EDIT:

All above problems go away if you save all the required files of your model (assuming you're using pytorch huggingface):

torch.save(model.state_dict(), PATH) 
config = model.config
config.save_pretrained(PATH)
tokenizer.save_vocabulary(PATH)
tokenizer.save_pretrained(PATH)
Akshayextreme commented 4 years ago

Hi @goggoloid What was accuracy with ALBERT model? Did you train on ALBERT V1 or V2 ?