facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

Where to get data2vec 2.0 token dictionary? #5037

Closed DanielFLevine closed 1 year ago

DanielFLevine commented 1 year ago

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Is the token dictionary for the data2vec 2.0 text model available anywhere? The 'data' field in the 'task' dictionary points to '/fsx-wav2vec/abaevski/data/nlp/bookwiki_aml-full-mmap2-bin', and I'm unable to load the checkpoint without the dict.text file. Is the dictionary identical to the RoBERTa 50k BPE or is yours different due to only being trained on BooksCorpus and English Wikipedia? Any help here is appreciated, thanks! @alexeib

Code

model, args, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([path_to_d2v2_text_cp], strict=False)

FileNotFoundError: [Errno 2] No such file or directory: '/fsx-wav2vec/abaevski/data/nlp/bookwiki_aml-full-mmap2-bin/dict.txt'

What have you tried?

Read the README file and tried looking through the checkpoint to find any dictionary.

What's your environment?

wnhsu commented 1 year ago

Fixed in #5045