acharkq / MolCA

Code for EMNLP2023 paper "MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter".
71 stars 1 forks source link

Tokenizer issue in pretrain_dataset.py for the evaluation of Molecule-Text Retrieval for PCDes #13

Open Koaladopamine opened 1 month ago

Koaladopamine commented 1 month ago

I run the script python stage1.py --root 'data/kv_data' --gtm --lm --devices '[0]' --filename pcdes_evaluation --init_checkpoint "all_checkpoints/share/stage1.ckpt" --rerank_cand_num 128 --num_query_token 8 --match_batch_size 64 --mode eval for Molecule-Text Retrieval for PCDes and encountered the following error: MolCA/data_provider/pretrain_dataset.py, line 78, in tokenizer_text sentence_token = self.tokenizer(text=text, TypeError: 'NoneType' object is not callable

To resolve this error, I imported the Blip2Base class from the model.blip2 and made it the tokenizer in the pretrain_dataset.py script (see the two screenshots below). However, the accuracy I obtained after running your checkpoint for the evaluation of Molecule-Text Retrieval for PCDes is much lower than what you posted on the paper. I am not sure if I used the same tokenizer you used when evaluating the accuracy. But the tokenizer error indeed existed, if some tokenizer is not added to the pretrain_dataset.py script.

add tokenizer to the pretrain_dataset script tokenizer initialization is incorrect in blip2 script

acharkq commented 3 weeks ago

Hi,

I think you are using the right tokenizer, and it should be other reason that caused your lower performance on the PCDes dataset. Have you figured it out by now?