Open yangziyi1990 opened 3 years ago
Hi @yangziyi1990, there are two separate models - one for pfam2vec that turns pfam IDs into meaningful vectors (based on word2vec), and one for BGC detection using a sequence of pfam2vec vectors (based on LSTM). The deepbgc train
command trains only the LSTM BGC detection model using BGC and non-BGC samples, so the pfam2vec vectors need to be provided.
Thanks, but how can i run the word2vec to turns pfam IDs into meaningful vectors? Is this code in this repository?
I guess you're mostly interested in training the BGC detection model on new BGC samples, right? So you can reuse the previous pfam2vec.csv file, it can be downloaded from here: https://github.com/Merck/deepbgc/releases/tag/v0.1.0
Maybe I want to know how to train a pfam2vec model, and generate pfam2vec.csv file……
@yangziyi1990 You can read our methods section to find how that was done. You'll need to run Pfam hmmscan on ~1000s of genomes to generate a training data corpus - this takes thousands of CPU hours, we did that on a high performance cluster.
Then the data needs to be formatted one contig per line, with Pfam IDs from that contig separated by spaces, e.g.:
Pfam0123 Pfam0456 Pfam0789 ... <- genome 1
Pfam0987 Pfam0654 ... <- genome 2
Then you can use this script to generate the vector embedding using word2vec: https://github.com/Merck/bgc-pipeline/blob/main/bgc_detection/features/pfam2vec.py
This part sounds like a huge workload, maybe I can use pfam2vec.csv directly when training deepbgc from scratch.
Hi, Prihoda: Here, I have some problems want to consult you. I try to train DeepBGC from scratch. According to the "Readme.md" file, I download the MIBiG.pfam.tsv and GeneSwap_Negatives.pfam.tsv files as the positive and negative samples. Then, I use to the following instruction to train DeepBGC:
But, I met the error:
Why do we need to load "pfam2vec.csv" during training? I think this should be obtained from the trained model.