DeepGraphLearning / ProtST

[ICML-23 ORAL] ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
Apache License 2.0
83 stars 7 forks source link

About data leakage on zero-shot classification? #9

Open LTEnjoy opened 10 months ago

LTEnjoy commented 10 months ago

Hello!

Thanks for your great work! I have tested the zero-shot classification given your released checkpoint and it did a good performance. But I am confused that whether there exists some data leakage problem? Your model was fine-tuned on Swiss-Prot database and the DeepLoc dataset was also constructed from UniProt database. Did you do some filtering when you tested zero-shot performance?

Looking forward to your reply! Thanks in advance!

KatarinaYuan commented 10 months ago

Hi, Thank you being interested in our work!

Please see the pre-training dataset https://github.com/DeepGraphLearning/ProtST/blob/db53a76ed2430eb66dd9c8134ace99fd60980fb3/protst/dataset.py#L22. It does not expose test labeled data of each benchmark dataset that has not been observed during multimodal pre-training nor downstream fine-tuning.

LTEnjoy commented 10 months ago

Hi,

Thank you for the reply and I'll check it out!