Response on the data leakage in PubChem324k dataset

@ZwormZ

Hi,

This is an update of our new experiment regarding the data leakage of the PubChem324k dataset. I chose to inform you in a new GitHub issue because I do not have your Email address.

We have filtered the PubChem324k's pretrain subset and conducted the major experiment of MolCA. I attach screenshots of performances below:

Molecule captioning on Chebi-20:

Molecule-Text Retrieval on PCDEs dataset:

Molecule-Text Retrieval on MoMu dataset:

Our observation is that MolCA still outperforms the baselines, despite lower performance using the new filtered dataset.

Note that, some baselines also include data leakage, and we have not finished their re-training yet. We plan to release the new filtered dataset and update our pdf, later when we finish reproducing these baselines.

acharkq / test

Response on the data leakage in PubChem324k dataset #1