When I download the BARTModel that is being stored in 'chemical/checkpoints/bart.large', it comes with a dict.txt that is ~52k lines. To my understanding this is BPE for the all of the words that BART was trained on.
Now, when I try to run my code to load the model while keeping the default dict.txt, I get an error that basically is saying that the number of tokens in the model do not match the checkpoint I am trying to load.
But if I place the dict.txt that gets generated from the preprocessing.py file (same as the chem.vocab.fs), it loads the model fine.
My question is, is this valid? Do I need a separate dict.txt? I'm concerned because the original dict.txt from BART-large is pairs of {token, count}, whereas the other dict.txt we are using is {str, token}.
Hi,
I am loading this model via fairseq in python, and am more or less copying the code in compute_score.py, line 79-81:
bart = BARTModel.from_pretrained(model, checkpoint_file = chkpt_path, bpe="sentencepiece", sentencepiece_model=f"{root}/BARTSmiles/chemical/tokenizer/chem.model")
And here is what I am calling:
When I download the BARTModel that is being stored in 'chemical/checkpoints/bart.large', it comes with a dict.txt that is ~52k lines. To my understanding this is BPE for the all of the words that BART was trained on.
Now, when I try to run my code to load the model while keeping the default dict.txt, I get an error that basically is saying that the number of tokens in the model do not match the checkpoint I am trying to load.
But if I place the dict.txt that gets generated from the preprocessing.py file (same as the chem.vocab.fs), it loads the model fine.
My question is, is this valid? Do I need a separate dict.txt? I'm concerned because the original dict.txt from BART-large is pairs of {token, count}, whereas the other dict.txt we are using is {str, token}.