Carrion-lab / bacLIFE

23 stars 3 forks source link

Minimal number of genomes needed #5

Closed sentausa closed 4 months ago

sentausa commented 4 months ago

Hi,

This seems to be a very interesting tool. I wonder how many genomes (with annotated metadata) are needed at minimum to get a good prediction of the bacterial lifestyle?

Thank you.

Carrion-lab commented 4 months ago

Hi Sentausa,

thanks for your kind words. Please see this text extracted from the methods sections:

Although bacLIFE is designed for big datasets, it can be used for small datasets with few genomes if there is a very strong association between genes and lifestyles; of course, as in all such statistical analyses, the more genomic observations are available in the input data, the more power the analysis will have. Crucial for the success rate is to have an approximate minimum of 10 genomes per lifestyle in the comparative analysis to have reliable statistics.

best wishes, victor.

sentausa commented 4 months ago

Ah, I didn't see that. But only 10? That's fantastic! Are there any further details in the paper or in the supplementary materials that support/show this? And thanks for the quick reply!

gguerr001 commented 4 months ago

Having 10-20 genomes should suffice for identifying lifestyle-associated genes, for instance, 10 plant pathogens versus 10 animal pathogens. However, when it comes to predicting bacterial lifestyles, the situation is different. I would advise having a minimum of 20 genomes per class to ensure reliable predictions. It's important to note that even with this amount of data, the prediction model may still be prone to overfitting due to the limited dataset.