YerevaNN / BARTSmiles

BARTSmiles, generative masked language model for molecular representations
MIT License
31 stars 4 forks source link

When loading BART via fairseq, which dict.txt should be in the model directory? #8

Open kosonocky opened 1 year ago

kosonocky commented 1 year ago

Hi,

I am loading this model via fairseq in python, and am more or less copying the code in compute_score.py, line 79-81:

bart = BARTModel.from_pretrained(model, checkpoint_file = chkpt_path, bpe="sentencepiece", sentencepiece_model=f"{root}/BARTSmiles/chemical/tokenizer/chem.model")

And here is what I am calling:

bart = BARTModel.from_pretrained('chemical/checkpoints/bart.large', 
                                 checkpoint_file='pretrained.pt', 
                                 bpe = 'sentencepiece',
                                 sentencepiece_model=f"chemical/tokenizer/chem.model")

When I download the BARTModel that is being stored in 'chemical/checkpoints/bart.large', it comes with a dict.txt that is ~52k lines. To my understanding this is BPE for the all of the words that BART was trained on.

Now, when I try to run my code to load the model while keeping the default dict.txt, I get an error that basically is saying that the number of tokens in the model do not match the checkpoint I am trying to load.

But if I place the dict.txt that gets generated from the preprocessing.py file (same as the chem.vocab.fs), it loads the model fine.

My question is, is this valid? Do I need a separate dict.txt? I'm concerned because the original dict.txt from BART-large is pairs of {token, count}, whereas the other dict.txt we are using is {str, token}.