Chloe-YibaiLiu commented 6 months ago

I've got checkpoints from HuggingFace and put them under the correct folders, but errors showed that tokenizer for "xlm-roberta-large" couldn't be loaded. Any other essential model files needed here for this command?

IronBeliever commented 5 months ago

Thanks for your interest！

We reproduced the whole process and did not find this issue. \ Hopefully you can provide more specific error logs. \ Maybe you need to make sure the device is connected to the internet.

AlienKevin commented 1 month ago

@Chloe-YibaiLiu I encountered the same error. It's likely that you are not connected to the internet or that HuggingFace is blocked in your region. If the connection cannot be easily established, the workaround is to manually download the tokenizer and config.json of XLMRoberta. Note that you do not need to download the big model weight as CaR seems to only need the tokenizer and config files.

Do the following steps in an environment that can connect to HuggingFace.

Download the tokenizer into a folder called xlm-roberta-large. Use the same version of transformers as CaR.
```
from transformers import XLMRobertaTokenizerFast
```

Specify the directory where you want to save the tokenizer files

save_directory = "xlm-roberta-large"

Download and save only the tokenizer

tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-large") tokenizer.save_pretrained(save_directory)


2. Download `config.json`:

cd xlm-roberta-large wget https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/config.json cd ..



3. Lastly, move the entire `xlm-roberta-large` folder to the root of your CaR project.

AlienKevin commented 1 month ago

If you encounter any error when using the above script to download the tokenizer, it's probably because CaR's transformers version is quite low. You can do the following to upgrade CaR to be compatible with transformers=4.44.1:

IronBeliever / CaR

Error when running split_IQS.py #2

Specify the directory where you want to save the tokenizer files

Download and save only the tokenizer