NBChub / bgcflow

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)
https://github.com/NBChub/bgcflow/wiki
MIT License
29 stars 7 forks source link

MiBIG and BIGSCAPE error #355

Closed TanyaC505 closed 1 week ago

TanyaC505 commented 1 week ago

Any recommendations on possible solution to the below errors that occurred whilst running bgcflow would be greatly appreciated. Thanks

Error in rule get_mibig_table: jobid: 173 output: resources/mibig/json, resources/mibig/df_mibig_bgcs.csv log: logs/bigscape/get_mibigtable.log (check log file(s) for error details) conda-env: /home/mk-mica-minion/Desktop/Tanya/Rhodococcus/Thirdtry/bgcflow/.snakemake/conda/61b5332396c2d7bb1ce5092174e049db shell:

    (cd resources && wget  https://dl.secondarymetabolites.org/mibig/mibig_json_3.1.tar.gz) &>> logs/bigscape/get_mibig_table.log
    (cd resources && tar -xvf mibig_json_3.1.tar.gz && mkdir -p mibig && mv mibig_json_3.1/ mibig/json && rm mibig_json_3.1.tar.gz) &>> logs/bigscape/get_mibig_table.log
    python workflow/bgcflow/bgcflow/data/get_mibig_data.py resources/mibig/json resources/mibig/df_mibig_bgcs.csv 2>> logs/bigscape/get_mibig_table.log

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule bigscape: jobid: 170 input: resources/BiG-SCAPE, data/interim/bgcs/Rhodococcus/Rhodococcus_antismash_7.1.0.csv, data/interim/bgcs/Rhodococcus/7.1.0 output: data/interim/bigscape/Rhodococcus_antismash_7.1.0/index.html log: logs/bigscape/Rhodococcus_antismash7.1.0/bigscape.log (check log file(s) for error details) conda-env: /home/mk-mica-minion/Desktop/Tanya/Rhodococcus/Thirdtry/bgcflow/.snakemake/conda/1cb315120f65f8ad51e3c6450bedf9ee shell:

    python resources/BiG-SCAPE/bigscape.py -i data/interim/bgcs/Rhodococcus/7.1.0 -o data/interim/bigscape/Rhodococcus_antismash_7.1.0/ -c 8 --cutoff 0.3 0.4 0.5 --include_singletons --label Rhodococcus_antismash_7.1.0 --hybrids-off --mibig --verbose &>> logs/bigscape/Rhodococcus_antismash_7.1.0/bigscape.log

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
matinnuhamunada commented 1 week ago

Hi @TanyaC505, thank you for using BGCFlow :)

Can you also share the log files here so we can debug the issue together? The log files can be found in these locations:

TanyaC505 commented 1 week ago

Hi Matin,

Thanks so much for the help :)

Please see the attached logs.

Kind regards

Tanya

On Thu, 5 Sept 2024 at 13:00, Matin Nuhamunada @.***> wrote:

Hi @TanyaC505 https://github.com/TanyaC505, thank you for using BGCFlow :)

Can you also share the log files here so we can debug the issue together? The log files can be found in these locations:

  • logs/bigscape/get_mibig_table.log
  • logs/bigscape/Rhodococcus_antismash_7.1.0/bigscape.log

— Reply to this email directly, view it on GitHub https://github.com/NBChub/bgcflow/issues/355#issuecomment-2331228009, or unsubscribe https://github.com/notifications/unsubscribe-auth/BLBIO4EN7QIMB57VBQEQ5CDZVA2TJAVCNFSM6AAAAABNV74KIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZRGIZDQMBQHE . You are receiving this because you were mentioned.Message ID: @.***>

TanyaC505 commented 1 week ago

get_mibig_table.log bigscape.log

matinnuhamunada commented 1 week ago

I see, for your first issue, the MIBiG download, it seems that the previous download is interrupted halfway, and the file is corrupted. I will improve the script to automatically detect this, but for now, try to remove the MIBiG file in your resources folder and then try running it again:

rm -rf /home/mk-mica-minion/Desktop/Tanya/Rhodococcus/Thirdtry/bgcflow/resources/mibig*

For the second issue, seems like there is an error with the libgfortran library:

  File "/home/mk-mica-minion/Desktop/Tanya/Rhodococcus/Thirdtry/bgcflow/.snakemake/conda/1cb315120f65f8ad51e3c6450bedf9ee_/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 30, in <module>
    from .pairwise_fast import _chi2_kernel_fast, _sparse_manhattan
ImportError: libgfortran.so.3: cannot open shared object file: No such file or directory

Can you tell me which Linux/Ubuntu version are you using? I cannot reproduce the issue you're facing, but I can suggest to maybe check and install the latest gcc: https://github.com/NBChub/bgcflow/wiki/00-Installation-Guide#gcc-compiler

matinnuhamunada commented 1 week ago

Ah, actually you're right, there is an issue with the installation script for BiG-SCAPE. I found the same problem on the missing dependencies. Will check what can be done.

matinnuhamunada commented 1 week ago

I fixed the script by pinning libgfortran to version 3.0.0. I will perform tests before incorporating the changes to the main branch. For now, you can try to run the updated BGCFlow by following this step:

git fetch # this will check if there are updates in the online repository
git pull # this will pull the update to the local directory
git checkout dev-1.1.1 # switch bgcflow from the main branch to the development branch

Then you should be able to run bgcflow normally. Let me know if this fixes your issue with bigscape.

PS: I will let you know when I finished the test, then you can go back to the main branch by:

git fetch
git pull
git checkout main
TanyaC505 commented 1 week ago

Thank you so much, I incorporated the above as suggested and there is a new error once running bgcflow: LockException: Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory: /home/mk-mica-minion/Desktop/Tanya/Rhodococcus/Thirdtry/bgcflow If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with the --unlock argument.

matinnuhamunada commented 1 week ago

Ah, that is a common error when the previous run is terminated forcefully (see https://snakemake.readthedocs.io/en/stable/project_info/faq.html#id30).

As instructed in the message, you can unlock the snakemake directory with:

bgcflow run --unlock

Then you should be able to run bgcflow normally again.

You can see the full command by typing bgcflow run --help

TanyaC505 commented 1 week ago

Thank you so much for all your help - bgcflow & report building worked successfully!

TanyaC505 commented 1 week ago

Hi Matin, although the bgcflow and report run successfully, for some reason not all the antismash results were incorporated into the big-scape network. For example, 218 BGCs from 14 genomes were detected using antismash, but BIG-SCAPE only incorporated 21 BGCs into the network and data processing. Do you perhaps have any recommendation on how I could perhaps fix this? Thank you.

matinnuhamunada commented 1 week ago

Can you send me the log files for the BIGSCAPE run?

TanyaC505 commented 1 week ago

bigscape_to_cytoscape-Rhodococcus-7.1.0.log copy_bigscape-Rhodococcus-7.1.0.log bigscape.log

matinnuhamunada commented 1 week ago

I see. Are you running your own genome sequences? What input file type are you using? Genbank or Fasta? You might want to make sure that your sequence accessions are unique.

I can see in the log run that for each genome, the sequence accession is named chromosome00001 etc. What happen is that BiG-SCAPE detected 218 BGCs, but because some of them have redundant names like chromosome00001.region012, chromosome00001.region016, etc, BiG-SCAPE assumed it as duplicates.

....
File data/interim/bigscape/Rhodococcus_antismash_7.1.0/cache/fasta/chromosome00001.region012.fasta already processed
  Adding chromosome00001.region012.gbk (50590 bps)
 File data/interim/bigscape/Rhodococcus_antismash_7.1.0/cache/fasta/chromosome00001.region016.fasta already processed
  Adding chromosome00001.region016.gbk (49828 bps)

 Starting with 218 files
 Files that had its sequence extracted: 22
...

What I would suggest is to rename the sequence accession in the Fasta files (or genbanks) and make it unique. For example:, if this is your original sequence fasta files:

rhodococcus_strain01.fasta

>chromosome00001
CGATGGTACA....
>chromosome00002

You can replace it by adding the genome ids into the sequence accession: rhodococcus_strain01.fasta

>rhodococcus_strain01__chromosome00001
CGATGGTACA....
>rhodococcus_strain01__chromosome00002

Ideally, you will get this unique identifier when submitting your sequences to a repository such as NCBI. You can of course came up with any unique identifier.

This of course mean that you need to re-run the whole workflow.

I will add this to the FAQ list.

TanyaC505 commented 1 week ago

I have my own genome sequences that are in fasta format. I have renamed the sequence accession within each file and rerun the bgcflow. I am confident this will solve the issue. Thanks so much for your help!

matinnuhamunada commented 1 week ago

Sounds great! 👍

I suggest to remove the previous data/interim folder (or even the whole data folder) to make sure everything is correct

matinnuhamunada commented 1 week ago

@TanyaC505 I've merged the update to the main branch (74666bb2730951b772b92733830c8643dc621963) so you can now switch back using git checkout main.

Thanks again for the feedback and improving BGCFlow!