How to evaluate the performance？

Thank you for your feedback! We apologize for not clearly explaining some details on GitHub.

Regarding the dataset labels: We confirm that the dataset is multi-label; however, the current version of the code we uploaded handles single-label tasks. The reason for choosing a single-label version is that, in certain biological contexts, single-label predictions can have unique significance. In such cases, single-label models can provide more straightforward answers that are easier to interpret and validate. Moreover, single-label prediction can reduce noise and improve accuracy, especially when multi-label tasks might introduce correlation or conflict between labels. Thus, we believe that single-label methods have biological relevance in specific experimental setups.
Regarding the dataset splitting issue: Before generating the QA dataset, we had already split the PPI dataset. We then generated the ProCoT-QA on the pre-split dataset to avoid data leakage. Moving forward, we plan to implement a method to check for any potential data leakage to ensure the integrity of our results.
Future updates: We plan to upload a multi-label version of the code and a complete evaluation script shortly. We will also update the README to provide detailed instructions on how to use these scripts to evaluate model performance. These updates will help users better understand and apply the model in multi-label tasks.

Thank you again for your attention and feedback!

MingyuJ666 / ProLLM