Does DeepBGC works with plants (Brassica spp) ?

Merck / deepbgc

BGC Detection and Classification Using Deep Learning

https://doi.org/10.1093/nar/gkz654

MIT License

123 stars 27 forks source link

Does DeepBGC works with plants (Brassica spp) ? #31

Closed anani-a-missinou closed 4 years ago

anani-a-missinou commented 4 years ago

I am a Ph.D. student metabolomic/specialized plant metabolic responses to pathogens.

your deep learning approach applied to the BGC is very innovative. I'd like to apply it to my plant data to predict the biosynthetic gene cluster.

Does DEEPPBGC works with plants (Brassica spp) data ?,

IMPORTANT: These plants contain approximately 6 copies of each gene resulting from duplication and therefore a large question of neofunctionalization and subfunctionalization.

Thank you for your reply

prihoda commented 4 years ago

Hi @2AMissinou, we designed and evaluated DeepBGC only on bacteria and fungi, though there are a few BGCs coming from plants in the MIBiG training set.

If you prepare a GenBank file with annotated CDS regions, you can try running DeepBGC, I would be curious to know if you get any signal.

anani-a-missinou commented 4 years ago

Thank prihoda you for your request. Okay, I want to check on my data.

wget http://eddylab.org/software/hmmer/hmmer-3.1b2.tar.gz tar xzf hmmer-3.1b2.tar.gz cd hmmer-3.1b2 ./configure CC=gcc LDFLAGS="-static" --prefix=/path/to/install/hmmer3 make make check sudo make install

You say ""Install HMMER and put the hmmscan and hmmpress binaries on your PATH:" how can I do this . Please helped my.

And after this how can I run with my .gbk annotated file CDS regions ? or just use this code:

deepbgc pipeline mySequence.fa
deepbgc pipeline --detector path/to/myDetector.pkl mySequence.fa

How .gbk file use during this running? should i put it in a predefined folder in your program? if yes which?

Thank for all your help

prihoda commented 4 years ago

Can you use Python through BioConda instead? https://bioconda.github.io/user/install.html#install-conda. That would be an easier way to install the dependencies. After setting up bioconda, you can just run conda install deepbgc to download DeepBGC, hmmer and Prodigal automatically.

Then you can just run deepbgc download and deepbgc pipeline yourFile.gbk --output yourOutputFolder

anani-a-missinou commented 4 years ago

Dear Prihoda,

First of all, thank you for your support.

I succeeded to install DEEPBGC on our cluster using bioconda as you recommended to me.

I tried to run on my organism which has 19 Chromosomes. But It just runs on first Chromosome A01, gives 30 clusters as findings for this chromosome and breaks running on Chromosome A02.

I add on this mail, a zipped file of this output, please could you help me to avoid this error, satisfy job running of your tools and as possible help me for the exploiting this output.

have a nice day, Anani M

output.zip

prihoda commented 4 years ago

Hi @2AMissinou, you'll have to add an ACCESSION field to each record in your input GenBank file so that DeepBGC can distinguish each chromosome:

LOCUS       chrA01              23267856 bp    dna     linear   UNK 01-JAN-1980
DEFINITION  .
ACCESSION   chrA01

anani-a-missinou commented 4 years ago

Dear Prihoda,

After modifying the accessions as you recommended it to me, DeepBGC worked well and generated a BGC compendium. Thank for your support.

In total, he took out 901 clusters throughout the genome with a BGC score> 0.5.

Due to a few BGCs in plants in the MIBiG training set, can this impact the calculated BGC score? Does your pipeline have any other interesting features?

THANK YOU. Anani :)

prihoda commented 4 years ago

Hi @2AMissinou, glad to hear that.

We haven't done any evaluation specifically on plants, so this is outside my expertise :)

A simple way to think about it is that the DeepBGC score of a given protein corresponds to the similarity of the given protein to the BGC proteins found in the training set. Our notion of similarity comes from the pfam2vec method, which is a mathematical way to tell the model whether the protein contains a certain pfam (structural and functional unit) as well as whether it contains similar pfams that are often found in different homologs. The LSTM then combines this information and takes the order and combination of these pfam2vec vectors into account.

anani-a-missinou commented 4 years ago

Dear Prihoda,

Thank you for your help and your explanations. I will explore these results by examining the clusters that react in our different transcriptome. I hope this will reveal an interesting aspect of metabolic responses in plant-biotic interactions.

Anani :)

prihoda commented 4 years ago

Great to hear that. Closing this issue now, let me know if you have any more questions.