Bitbol-Lab / ProtMamba-ssm

ProtMamba: a homology-aware but alignment-free protein state space model
https://www.biorxiv.org/content/10.1101/2024.05.24.595730v1
Apache License 2.0
44 stars 7 forks source link

Test Set Creation #4

Closed smdrnks closed 2 months ago

smdrnks commented 2 months ago

Hi thank you for your work and the nice codebase. I have a question regarding the creation of the held out test set, referenced by "encoded_MSAs_test.pkl". In core.py training and validation are split on the fly with seed=0 and eval_size=193. Am I correct that the test set is sampled similarly, i.e. from the full training set (without replacement), with the same seed, and consisting of 500 clusters as described in section 2.3 in the paper?

Thank you!

damiano-sg commented 2 months ago

Hi, the test set is made of 500 clusters held out from the training set, but we use them in different ways, to compute the test perplexity we use all of them, to generate novel sequences instead we use only few clusters (the ones listed in the appendix). We realized that we still didn't share the full test set in this repo and we plan to do it as soon as possible!

smdrnks commented 2 months ago

Providing the test set would be great! Thank you for your answer.

CyrilMa commented 2 months ago

Hi @smdrnks,

Thank you for your patience. We have added two new txt files with the cluster ids used in the train/test split: test ids and train ids here.

Please let us know if you need anything else!