CATH4.3 split - leaking topology?

Hi! In the paper, you mention you treat CATH4.3 same as ESM-IF does. ESM-IF does a topology split + some extra filtering, but in the main paper they only use the topology split and report some results over more stringent test sets in the appendix. Therefore I assume, your CATH4.3 are topology splits.

In your splits released online, I might have found some leaks, please correct me if I misunderstand the topology splitting. What I did was to lookup the mapping between topology classes and PDB chaincodes released officially by CATH (ftp://orengoftp.biochem.ucl.ac.uk:21/cath/releases/all-releases/v4_3_0/cath-classification-data/cath-names-v4_3_0.txt), i.e. for each PDB chaincode I would have a set of CATH IDs. For the CATH IDs with the level of topology included, i.e. in the form C.A.T.XXX, I would extract the T as the topology class.

For your test and train splits I would construct sets of topology classes. I would find 55 different topology classes in the test set, 329 different topology classes in the trainset, where 34 classes are overlapping between the test and train set.

I may be missing something. I would be very grateful for any clarification of the data-splitting protocol.

Thank you very much in advance! Best,

Petr Kouba

A4Bio / ProteinInvBench

CATH4.3 split - leaking topology? #9