cwszz / XPR

Cross-lingual Phrase Retriever
MIT License
7 stars 2 forks source link

Cannot load model from the hugging face checkpoint #8

Open karynaur opened 8 months ago

karynaur commented 8 months ago

Hey, I am unable to load the model from the huggingface checkpoint. Here is the code and the error:

from DictMatching.moco import MoCo
from utilsWord.test_args import getArgs
from transformers import AutoConfig, AutoTokenizer
import torch

args = getArgs()

tokenizer = AutoTokenizer.from_pretrained('cwszz/XPR')
config = AutoConfig.from_pretrained("cwszz/XPR")
model = MoCo(config=config,args=args,K=0,T=0.06)
model.load_state_dict(torch.load('model/pytorch_model.bin'))

The error im getting:

Traceback (most recent call last):
  File "analysis.py", line 19, in <module>
    model.load_state_dict(torch.load('model/pytorch_model.bin'))
  File "/home/adityas/.local/lib/python3.8/site-packages/torch/serialization.py", line 1026, in load
    return _load(opened_zipfile,
  File "/home/adityas/.local/lib/python3.8/site-packages/torch/serialization.py", line 1438, in _load
    result = unpickler.load()
  File "/home/adityas/.local/lib/python3.8/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 198, in __setstate__
    self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
AttributeError: 'XLMRobertaTokenizer' object has no attribute 'sp_model_proto'

@cwszz can you help me with this?

cwszz commented 8 months ago

@karynaur Our model cannot be loaded in this manner. Please refer to predict.py for specific loading instructions. In Hugging Face, only the parameters (.bin file) are saved, while the tokenizer and config need to be loaded using from_pretrained('xlm-roberta-base').

karynaur commented 8 months ago

Downloading the model and running the predict.py file gives the same error @cwszz


Traceback (most recent call last):
  File "predict.py", line 115, in <module>
    model = torch.load("model/pytorch_model.bin")
  File "/home/adityas/.local/lib/python3.8/site-packages/torch/serialization.py", line 1026, in load
    return _load(opened_zipfile,
  File "/home/adityas/.local/lib/python3.8/site-packages/torch/serialization.py", line 1438, in _load
    result = unpickler.load()
  File "/home/adityas/.local/lib/python3.8/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 198, in __setstate__
    self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
AttributeError: 'XLMRobertaTokenizer' object has no attribute 'sp_model_proto'```
cwszz commented 8 months ago

@karynaur It seems you're not using our predict.py, where there isn't a line 115. Also, please check if your tokenizer is directly from_pretrained(xlmr-base).

karynaur commented 8 months ago

Thanks for pointing that out @cwszz. I did make a few modifications to the file. But even on running the predict.py on a new colab environment, it was giving me the same error. image


Loading tsv from /content/drive/MyDrive/Honours Project/code/XPR/data/sentences/en-ro-phrase-sentences.32.tsv ...
Loading tsv from /content/drive/MyDrive/Honours Project/code/XPR/data/sentences/ro-phrase-sentences.32.tsv ...
[!] collect 67 samples
没找到共0
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Traceback (most recent call last):
  File "/content/XPR/predict.py", line 97, in <module>
    model = torch.load(args.load_model_path)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 198, in __setstate__
    self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
AttributeError: 'XLMRobertaTokenizer' object has no attribute 'sp_model_proto'```
cwszz commented 8 months ago

@karynaur It seems that the tokenizer is being loaded during the load process. Currently, my speculation is that it might be due to a transformer version issue. Could you please check why the init method of the tokenizer is being invoked during the load process?

karynaur commented 8 months ago

Can you let me know which transformer and torch version was used while the model was trained? Ill try to downgrade the transformers version and recheck it @cwszz

cwszz commented 8 months ago

@karynaur You can try version 4.17.0 first. Additionally, the problem lies with the loading of the tokenizer. If you could investigate this issue, changing the version might not be necessary.

karynaur commented 8 months ago

Gotcha. I'll update after I find a fix and if It's helpful ill send a PR