Confusion about splitting the dataset by similarity

KennthShang / PhaVIP

Phage virion protein classifier

GNU General Public License v3.0

10 stars 0 forks source link

Confusion about splitting the dataset by similarity #8

Closed moonhuwaa closed 1 week ago

moonhuwaa commented 1 month ago

Hi Jiayu,

I would like to express my appreciation for providing Phavip. However, I have some confusion about dataset splitting. For datasets divided by similarity, first of all, thank you very much for providing PVP datasets under different thresholds. However, I have some questions about how to select non PVP data when performing binary classification tasks. Can you provide an explanation on how to select non PVP data?

Best regards, moonhuwaa

KennthShang commented 1 month ago

Hi there,

We followed the definition of the cited paper such as deepvp. BTW, most of these non-PVP data are enzymes.

Hope this information will help

Best, Jiayu

moonhuwaa commented 1 month ago

Thank you for your answer. Additionally, I would like to know how to select non PVP in the training and testing sets when performing binary classification tasks (PVP recognition) on datasets divided by similarity. For example, at 40% similarity, you already provide the training and testing sets for PVP, but did not provide non PVP training and testing sets. Are they randomly selected from the original non PVP dataset? If so, what is the number of non PVP samples?

KennthShang commented 1 month ago

They are randomly selected and the number of non-PVP should be equal to the PVP to ensure a balanced dataset

moonhuwaa commented 1 month ago

Thank you for your reply