Why models perform worse on the 60% identity split LBA dataset than on the 30% split?

drorlab / atom3d

ATOM3D: tasks on molecules in three dimensions

MIT License

300 stars 35 forks source link

Hi, Thank you for the amazing work! I am curious about the results in table 8. Why most models (other than GNN) perform dramatically worse in the 60% identity split than in the 30% identity split? Intuitively, the task with 60% split should be easier and achieve better performance as there is more similarity between protein sequences.

I agree with your theoretical guess. However, not only atom3d but also some other following studies show the similiar phenomenon. For instance, the table below is copied from Multi-Scale Representation Learning on Proteins

drorlab / atom3d

Why models perform worse on the 60% identity split LBA dataset than on the 30% split? #57