Merck / deepbgc

BGC Detection and Classification Using Deep Learning
https://doi.org/10.1093/nar/gkz654
MIT License
123 stars 27 forks source link

leave-class out validation #64

Closed pmobio closed 2 years ago

pmobio commented 2 years ago

Hi, prihoda:

I met some problems when using DeepBGC need for your help. If I want to use leave-class-out validation to demonstrate the good performance of the model. I use the following command:

$ deepbgc train --model deepbgc.json --config PFAM2VEC pfam2vec.csv --validation vali_pos_Alkaloid.tsv --validation vali_neg.tsv train_pos_Alkaloid.tsv train_neg.tsv

where train_pos_Alkaloid.tsv contains 1077 samples from five classes (Polyketide, NRP, RiPP, Saccharide, and Terpene), train_neg.tsv contains 6752 samples, with random two thirds of negative samples, vali_pos_Alkaloid.tsv contains 500 samples by sampling with replacement according to the Alkaloid pfam file, and vali_neg.tsv contains the remaining third of negative samples

Here, my question is: 1) Is this command correct to the leave-class-out validation? 2) This process seems to require huge memory. I encountered a "Segmentation fault" issue during operation.

I look forward to hearing from you soon.

prihoda commented 2 years ago

Hi @pmobio,

Your approach should work. When you apply it to the original MiBIG used in the paper, can you reproduce our ROC results?

Our leave-class-out analysis was using an older version of deepbgc which didn’t have the train command yet, so it was a bit more complex. You can look into the dvc files here, although not sure how useful they are for the current deepbgc version: https://github.com/Merck/bgc-pipeline/tree/main/data/evaluation/lco-neg-10k

If you wanted to combine results across all BGC classes into one ROC curve, you will need to apply each model to its validation set, concatenate the prediction tsv files and then somehow generate the ROC from that using some of the deepbgc code.

I believe we also averaged the results across multiple models (multiple random seeds), not sure of that can be passed in the config json. You would also need to concatenate the predictions and create the ROC from that.

As for the memory, I can imagine it can use up a few GBs, but definitely not more than ~8, we were able to perform this on a laptop.