Closed tobyperrett closed 1 year ago
Could this be because the tokenizer file is not downloaded?
Thanks for the quick reply! This is the contents of transformers_cache/deberta-v2-xlarge:
config.json pytorch_model.bin README.md spm.model tf_model.h5 tokenizer_config.json
I also have the MSRVTT vocab file downloaded data/MSRVTT-QA/vocab.json (and I've tried the vocab1000.json as well).
If these files are all properly downloaded (not corrupted), something else I could think of is to make sure the transformers library version matches the one used here. It seems to be an error specific to the loading of the tokenizer.
Some of them were corrupted - I downloaded on a different computer, copied them over and it works now. Thanks!
Thanks for the quick reply! This is the contents of transformers_cache/deberta-v2-xlarge:
config.json pytorch_model.bin README.md spm.model tf_model.h5 tokenizer_config.json
I also have the MSRVTT vocab file downloaded data/MSRVTT-QA/vocab.json (and I've tried the vocab1000.json as well).
Hello! Could I ask where to download the vocab file (vocab.json/vocab1000.json)? would it be possible to kindly provide me with a link to it? Thanks!
Hi, as stated in the downloading instructions of the readme, you can find vocab here: https://drive.google.com/drive/u/3/folders/1ED2VcFSxRW9aFIP2WdGDgLddNTyEVrE5.
Thanksss! also for this GREAT work!
Hi. Thanks for providing code! I'm having the same issue as #3 on the VQA demo. I have the Microsoft deberta-v2-xlarge ( https://huggingface.co/microsoft/deberta-v2-xlarge ) downloaded from huggingface in a folder called transformers_cache. I've set the TRANSFORMERS_CACHE environment variable to point at it (if I remove this, it complains that deberta is missing, so I assume this part is correct). Do you have any idea why it might be failing?
The command I'm running is:
python demo_videoqa.py --combine_datasets msrvtt --combine_datasets_val msrvtt \ --suffix="." --max_tokens=256 --ds_factor_ff=8 --ds_factor_attn=8 \ --load=models/frozenbilm.pth --msrvtt_vocab_path=data/MSRVTT-QA/vocab.json \ --question_example question --video_example test.mp4 --device='cpu'
And the error is:
Traceback (most recent call last): File "demo_videoqa.py", line 170, in
main(args)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, *kwargs)
File "demo_videoqa.py", line 32, in main
tokenizer = get_tokenizer(args)
File "/user/work/tp8961/FrozenBiLM/model/init.py", line 96, in get_tokenizer
tokenizer = DebertaV2Tokenizer.from_pretrained(
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
return cls._from_pretrained(
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
tokenizer = cls(init_inputs, **init_kwargs)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 145, in init
self._tokenizer = SPMTokenizer(vocab_file, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 296, in init
spm.load(vocab_file)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(890) [model_proto->ParseFromArray(serialized.data(), serialized.size())]