Tokenizer issue in pretrain_dataset.py for the evaluation of Molecule-Text Retrieval for PCDes

I run the script python stage1.py --root 'data/kv_data' --gtm --lm --devices '[0]' --filename pcdes_evaluation --init_checkpoint "all_checkpoints/share/stage1.ckpt" --rerank_cand_num 128 --num_query_token 8 --match_batch_size 64 --mode eval for Molecule-Text Retrieval for PCDes and encountered the following error: MolCA/data_provider/pretrain_dataset.py, line 78, in tokenizer_text sentence_token = self.tokenizer(text=text, TypeError: 'NoneType' object is not callable

To resolve this error, I imported the Blip2Base class from the model.blip2 and made it the tokenizer in the pretrain_dataset.py script (see the two screenshots below). However, the accuracy I obtained after running your checkpoint for the evaluation of Molecule-Text Retrieval for PCDes is much lower than what you posted on the paper. I am not sure if I used the same tokenizer you used when evaluating the accuracy. But the tokenizer error indeed existed, if some tokenizer is not added to the pretrain_dataset.py script.

add tokenizer to the pretrain_dataset script tokenizer initialization is incorrect in blip2 script

acharkq / MolCA

Tokenizer issue in pretrain_dataset.py for the evaluation of Molecule-Text Retrieval for PCDes #13