Problem for training DeepBGC from scratch

Merck / deepbgc

BGC Detection and Classification Using Deep Learning

https://doi.org/10.1093/nar/gkz654

MIT License

127 stars 27 forks source link

Problem for training DeepBGC from scratch #46

Open yangziyi1990 opened 3 years ago

yangziyi1990 commented 3 years ago

Hi, Prihoda: Here, I have some problems want to consult you. I try to train DeepBGC from scratch. According to the "Readme.md" file, I download the MIBiG.pfam.tsv and GeneSwap_Negatives.pfam.tsv files as the positive and negative samples. Then, I use to the following instruction to train DeepBGC:

deepbgc train --model deepbgc.json --output MyDeepBGCDetector.pkl MIBiG.pfam.tsv GeneSwap_Negatives.pfam.tsv

But, I met the error:

Why do we need to load "pfam2vec.csv" during training? I think this should be obtained from the trained model.

prihoda commented 3 years ago

Hi @yangziyi1990, there are two separate models - one for pfam2vec that turns pfam IDs into meaningful vectors (based on word2vec), and one for BGC detection using a sequence of pfam2vec vectors (based on LSTM). The deepbgc train command trains only the LSTM BGC detection model using BGC and non-BGC samples, so the pfam2vec vectors need to be provided.

yangziyi1990 commented 3 years ago

Thanks, but how can i run the word2vec to turns pfam IDs into meaningful vectors? Is this code in this repository?

prihoda commented 3 years ago

I guess you're mostly interested in training the BGC detection model on new BGC samples, right? So you can reuse the previous pfam2vec.csv file, it can be downloaded from here: https://github.com/Merck/deepbgc/releases/tag/v0.1.0

yangziyi1990 commented 3 years ago

Maybe I want to know how to train a pfam2vec model, and generate pfam2vec.csv file……

prihoda commented 3 years ago

@yangziyi1990 You can read our methods section to find how that was done. You'll need to run Pfam hmmscan on ~1000s of genomes to generate a training data corpus - this takes thousands of CPU hours, we did that on a high performance cluster.

Then the data needs to be formatted one contig per line, with Pfam IDs from that contig separated by spaces, e.g.:

Pfam0123 Pfam0456 Pfam0789 ... <- genome 1
Pfam0987 Pfam0654 ... <- genome 2

Then you can use this script to generate the vector embedding using word2vec: https://github.com/Merck/bgc-pipeline/blob/main/bgc_detection/features/pfam2vec.py

yangziyi1990 commented 3 years ago

This part sounds like a huge workload, maybe I can use pfam2vec.csv directly when training deepbgc from scratch.