A4Bio / ProteinInvBench

The official implementation of the NeurIPS'23 paper ProteinInvBench: Benchmarking Protein Design on Diverse Tasks, Models, and Metrics
170 stars 7 forks source link

CATH4.3 split - leaking topology? #9

Closed KoubaPetr closed 9 months ago

KoubaPetr commented 9 months ago

Hi! In the paper, you mention you treat CATH4.3 same as ESM-IF does. ESM-IF does a topology split + some extra filtering, but in the main paper they only use the topology split and report some results over more stringent test sets in the appendix. Therefore I assume, your CATH4.3 are topology splits.

In your splits released online, I might have found some leaks, please correct me if I misunderstand the topology splitting. What I did was to lookup the mapping between topology classes and PDB chaincodes released officially by CATH (ftp://orengoftp.biochem.ucl.ac.uk:21/cath/releases/all-releases/v4_3_0/cath-classification-data/cath-names-v4_3_0.txt), i.e. for each PDB chaincode I would have a set of CATH IDs. For the CATH IDs with the level of topology included, i.e. in the form C.A.T.XXX, I would extract the T as the topology class.

For your test and train splits I would construct sets of topology classes. I would find 55 different topology classes in the test set, 329 different topology classes in the trainset, where 34 classes are overlapping between the test and train set.

I may be missing something. I would be very grateful for any clarification of the data-splitting protocol.

Thank you very much in advance! Best,

Petr Kouba

KoubaPetr commented 9 months ago

I probably figured out the splitting, it is not only on the topology level but including the class and architecture as well, so in case of the C.A.T codes the splits are non-ovelapping. I am closing the issue.