acharkq / test

0 stars 0 forks source link

Response on the data leakage in PubChem324k dataset #1

Open acharkq opened 6 months ago

acharkq commented 6 months ago

@ZwormZ

Hi,

This is an update of our new experiment regarding the data leakage of the PubChem324k dataset. I chose to inform you in a new GitHub issue because I do not have your Email address.

We have filtered the PubChem324k's pretrain subset and conducted the major experiment of MolCA. I attach screenshots of performances below:

Molecule captioning on Chebi-20:

Screenshot 2023-12-20 at 14 21 43

Molecule-Text Retrieval on PCDEs dataset:

Screenshot 2023-12-20 at 14 23 01

Molecule-Text Retrieval on MoMu dataset:

Screenshot 2023-12-20 at 14 23 43

Our observation is that MolCA still outperforms the baselines, despite lower performance using the new filtered dataset.

Note that, some baselines also include data leakage, and we have not finished their re-training yet. We plan to release the new filtered dataset and update our pdf, later when we finish reproducing these baselines.

acharkq commented 6 months ago

@ZwormZ

Thank you again for raising this issue to us. Would you let us know your thoughts on our updated results. We will very much appreciate it.

Thanks!