blobtoolkit / pipeline

[Archived] SnakeMake pipeline to run BlobTools on public assemblies
https://blobtoolkit.genomehubs.org
MIT License
10 stars 4 forks source link

running pipeline locally with singularity (or docker) #4

Closed alxsimon closed 4 years ago

alxsimon commented 4 years ago

Hi, From the documentation, it seems possible to run the pipeline locally on a draft assembly using the docker container (either using docker itself or singularity).

However I have a hard time navigating all the configurations and requirements. Could it be possible to add an example of such use case in the repository or documentation?

Thanks, Alexis

rjchallis commented 4 years ago

Hi Alexis

Thanks for the suggestion - I'll try to add a clear example to the docs next week.

rjchallis commented 4 years ago

I've added a new page to the Pipeline docs at blobtoolkit.genomehubs.org/pipeline/pipeline-tutorials/running-the-pipeline-in-a-container/ that hopefully makes things a bit clearer for the specific case of running a local assembly in Docker. Do let me know how you get on following this.

alxsimon commented 4 years ago

Thank you very much, it is now very clear how the different layers of tools interact.

I encountered a first problem easy to solve, in the snakemake command you should add the option -s /blobtoolkit/insdc-pipeline/Snakefile. Otherwise Snakmake complains it cannot find the Snakefile.

rjchallis commented 4 years ago

Thanks for spotting that - I've added the option to the docs

alxsimon commented 4 years ago

Not related to the containerized execution I think (or maybe if the pipeline version in the container is too old) but I have an error fetching the blast ncbi db. No matches on pattern 'nt_v5.??.tar.gz'

When looking in ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/ I don't find indeed any nt_v5 file, maybe it should be changed to nt.??.tar.gz ?

rjchallis commented 4 years ago

should definitely be nt.??.tar.gz. The pipeline used to have to look for nt_v5 before blast made it the default and decided not to keep an alias. It should only do this if nt_v5 is used as the db name in the config - I thought I'd removed the instances of nt_v5 in the docs but just found a few that had slipped through.

alxsimon commented 4 years ago

Hi, the pipeline was running smoothly until I got an error in the blobtoolkit_create rule. The snakemake error is the missing output file (even when increasing --latency-wait).

I am wondering if the shell command should be blobtools create instead of blobtools replace?

alxsimon commented 4 years ago

In fact the error is as follows in the log of the blobtools_create rule:

Traceback (most recent call last):   
File "/blobtoolkit/blobtools2/lib/add.py", line 65, in <module>
  import blob_db
File "/blobtoolkit/blobtools2/lib/blob_db.py", line 10, in <module>
  import cov
File "/blobtoolkit/blobtools2/lib/cov.py", line 15, in <module>
  import pysam
File "/home/blobtoolkit/miniconda3/envs/btk_env/lib/python3.7/site-packages/pysam/__init__.py", line 5, in <module>
  from pysam.libchtslib import *
ModuleNotFoundError: No module named 'pysam.libchtslib'

If I try to create the conda env from the blobtools2.yaml I can import correctly the faulty module, so I don't know where this comes from.

alxsimon commented 4 years ago

Error above was when using singularity as the container manager, However switching back to docker I have a completely different error

Loading sequences from Gallo_Med_v1.fasta  
Traceback (most recent call last): 
  File "/blobtoolkit/blobtools2/lib/add.py", line 153, in <module>  
    main()  
  File "/blobtoolkit/blobtools2/lib/add.py", line 120, in main 
    meta=meta) 
  File "/blobtoolkit/blobtools2/lib/fasta.py", line 56, in parse
    _gc_portions[seq_id], _n_counts[seq_id] = base_composition(seq_str) 
  File "/blobtoolkit/blobtools2/lib/fasta.py", line 29, in base_composition
    gc_portion = float("%.4f" % (gc_count / acgt_count)) 
ZeroDivisionError: division by zero

EDIT: sorry did not see issue #7 of blobtools2

rjchallis commented 4 years ago

Hi - sorry not to have been active on this over the last week. Thanks for pasting in the errors from the log files.

The Docker error looks like it is due to a sequence with no ACGT bases - I'll need to add some code to catch this and print a warning. Could you check your assembly for contigs with only Ns to confirm this? EDIT: sorry, did not see your edit above.

As for singularity - I'm not sure why it is not finding the module. Sometimes Docker images don't behave as expected with singularity so I expect I will have to make a specific singularity image rather than relying on the Docker one.

alxsimon commented 4 years ago

Thanks and no worries, I went back to it only today myself.

Indeed there was some N-only sequences that I removed from the reference now, I relaunched the pipeline with docker and will see if it finishes.

I tried to use singularity because I know it way better than docker, but I guess having a docker only image is fine. (Well in fact there is another reason I wanted to use singularity, which is I plan to include the insdc pipeline into a bigger snakemake pipeline, I don't know if this will work in the end but I thought using singularity would simplify the compatibility.)

alxsimon commented 4 years ago

It finished OK when using docker.