IBM / AutoPeptideML

AutoML system for building trustworthy peptide bioactivity predictors
https://ibm.github.io/AutoPeptideML/
MIT License
19 stars 0 forks source link

Duplicate sequences in benchmark data #24

Closed iaposto closed 1 day ago

iaposto commented 5 days ago

Hello. Congrats on the paper and very interesting tool! I am working with the data provided in the documentation, specifically the New AutoPeptideML Benchmarks set that you used for model development. I noticed there are duplicate sequences in the training and test sets for most bioactivity datasets, as well as overlap of peptides in the two sets. Was this intended?

RaulFD-creator commented 3 days ago

Hi @iaposto, thanks for your kind words! The short answer is that it is something intended, but not the ideal scenario.

The duplicated sequences appear in datasets for which not enough negative samples could be drawn from the AutoPeptideML - Peptipedia subset. This database is fairly big, but for datasets with a lot of samples like Antibacterial or Antimicrobial (which also comprise a significant amount of Peptipedia) and after excluding overlapping bioactivities, positive and negative samples were unbalanced. AutoPeptideML, by default, oversamples (randomly duplicates) the underrepresented class, in this case, the negative peptides. Hence the duplicated entries. I have double-checked in case something had slipped through the cracks but I couldn't find any instance of a duplicated positive entry. Please, if you have found any let me know as it may be due to some bug in the code that needs to be addressed.

In the end, the datasets keep the duplicated entries for the sake of reproducibility, but there may be cases where users may have better strategies for handling the unbalance where it makes sense to drop duplicated entries.

Regarding, the overlap between peptides in the two sets, could you clarify what do you mean?

I hope that I have been able to address your question. Please do not hesitate to follow up if you have any further questions.

iaposto commented 1 day ago

@RaulFD-creator thank you for the prompt and detailed answer!