Merck / deepbgc

BGC Detection and Classification Using Deep Learning
https://doi.org/10.1093/nar/gkz654
MIT License
118 stars 27 forks source link

using deepBGC with metagenomes #43

Open drelo opened 3 years ago

drelo commented 3 years ago

Dear users, I wonder if I can use deepBGC with metagenomic samples? In the paper describing the software it is mentioned as a useful tool for this kind of data but I don't know if it is implemented in the current version. I run a test with a sample (CPB-18) which is the scaffold file obtained from SPAdes and it quickly returned 0 matches I don't understand if this is a matter of the format I used or something else. This same file returned several matches or bgc with antiSMASH.

I noticed these lines while running it

/mnt/ubi/andres/miniconda3/envs/deepbio/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.tree.tree module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API. warnings.warn(message, FutureWarning) /mnt/ubi/andres/miniconda3/envs/deepbio/lib/python3.7/site-packages/sklearn/base.py:334: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.18.2 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk. UserWarning) /mnt/ubi/andres/miniconda3/envs/deepbio/lib/python3.7/site-packages/sklearn/base.py:334: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.18.2 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.

Before that I run the BGC sample included within the test folder and I obtained 2 hits as seen in the log attached here (BGC15 file).

Maybe I have a broken install of the program, I followed the conda instructions. Please find attached the log from deepbgc info too.
pipeinfo.txt

BGC15.txt sample.txt

Thanks for your help.

prihoda commented 3 years ago

Hi @drelo yes we are also using DeepBGC on metagenomic samples. Generally the longer your sequence, the better. You can use --prodigal-meta-mode to run Prodigal in '-p meta' mode to enable detecting more genes in short contigs.

The warnings should not be related. Can you check what you get in the output *.pfam.tsv file? Do you get any protein domains? There's also a deepbgc_score column that gives you a BGC probability for each protein domain.

If there are some protein domain hits, you can also run deepbgc with a lower --score threshold to change the BGC cutoff - you should be able to check evaluation/*.score.png to see which regions would become BGCs if you chose a lower threshold.

drelo commented 3 years ago

Thanks for your reply, I got no .tsv file and no files in the evaluation folder. Do I need to process the multifasta from SPAdes in order to provide it to deepBGC? What seems odd is the fasta file is +600 Mb and it is processed really quick. Thanks for your help!

El sáb, 28 nov 2020 a las 8:59, David Příhoda (notifications@github.com) escribió:

Hi @drelo https://github.com/drelo yes we are also using DeepBGC on metagenomic samples. Generally the longer your sequence, the better.

The warnings should not be related. Can you check what you get in the output *.pfam.tsv file? Do you get any protein domains? There's also a deepbgc_score column that gives you a BGC probability for each protein domain.

If there are some protein domain hits, you can also run deepbgc with a lower --threshold to change the BGC cutoff - you should be able to check evaluation/*.score.png to see which regions would become BGCs if you chose a lower threshold.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Merck/deepbgc/issues/43#issuecomment-735222430, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACO2NCUKDFPGYI3UICFC2LLSSDQ2FANCNFSM4UEEZSFA .

prihoda commented 3 years ago

That sounds suspicious indeed. Can you try running deepbgc with the SPAdes contigs.fa file instead of the scaffolds file and adding the deepbgc --prodigal-meta-mode flag? If that still fails, it would be great if I could see one of the sequences in that FASTA file.

drelo commented 3 years ago

I found the error here, I am working at 2 clusters and by mistake I copied a file that had badly parsed fasta headers. Now it is running smoothly.

A follow up question (or let me know if I should start a new issue) is there a way to combine results from +1 sample (from similar environment, nearby area, etc) ?

Thanks for your help

El sáb, 28 nov 2020 a las 16:13, David Příhoda (notifications@github.com) escribió:

That sounds suspicious indeed. Can you try running deepbgc with the SPAdes contigs.fa file instead of the scaffolds file and adding the --prodigal-meta-mode flag? If that still fails, it would be great if I could see one of the sequences in that FASTA file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Merck/deepbgc/issues/43#issuecomment-735278531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACO2NCTS4OVIZJTIH4MEXDTSSFDVPANCNFSM4UEEZSFA .

prihoda commented 3 years ago

Great. What exactly do you mean by combining results?

drelo commented 3 years ago

For example there are 2 samples from the same place but collected at a different time that I would like to combine to just have a glimpse of the diversity at that site (regardless the temporal dimension) or combine several samples from similar environments ('pooling' urban or rural). Now with the results that are still accumulating I noticed there is an output as tsv so I think I could just parse/merge them.

Best,

Andres

El dom, 29 nov 2020 a las 10:51, David Příhoda (notifications@github.com) escribió:

Great. What exactly do you mean by combining results?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Merck/deepbgc/issues/43#issuecomment-735398094, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACO2NCSM5W7LXECE4JDBAC3SSJGWHANCNFSM4UEEZSFA .

prihoda commented 3 years ago

Exactly, you can merge the TSV files (there's a BGC-level TSV and a protein domain-level TSV) or the genbank files.

There's also a recent paper that introduces a method for visualizing BGCs called BGCViz, so you could give that a shot: https://github.com/pavlohrab/BGCViz or the web interface https://biopavlohrab.shinyapps.io/BGCViz/

prihoda commented 3 years ago

BGCViz is relevant if you are also analysing your samples with other tools like antiSMASH.

drelo commented 3 years ago

Thanks for all your help, I didn't know about BGCViz so that will be my next step in the exploration.

Cheers

El lun, 30 nov 2020 a las 4:43, David Příhoda (notifications@github.com) escribió:

BGCViz is relevant if you are also analysing your samples with other tools like antiSMASH.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Merck/deepbgc/issues/43#issuecomment-735612589, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACO2NCQLKXXVAHC5GUKTFITSSNEJ3ANCNFSM4UEEZSFA .