Closed cahende closed 2 years ago
Hello Cory!
Unfortunately, I don't have experience with Snakemake... Can you tell me what scheduler (if any) you are using on your computing platform?
In my case (slurm
) I do the following:
For each run I generate and submit a sbatch
file looking like this:
module load singularity
singularity exec --bind ~Project:/mnt ~/dnaPipeTE/dnapipete.img /mnt/dnaPipeTE_cmd.sh
~Project is my local directory that contains the reads and will host the outputs.
The dnaPipeTE_cmd.sh
file that contains the actual dnaPipeTE commands looks like:
#! /bin/bash
cd /opt/dnaPipeTE
python3 dnaPipeTE.py -input /mnt/reads_input.fastq -output /mnt/output -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 170000000 -genome_coverage 0.1 -sample_number 2 -RM_t 0.2 -cpu 2
Currently there isn't a designated way to run samples in batch with dnaPipeTE. Though this is a good suggestion, and I will consider including this in the current development of dnaPipeTE2.
Regarding how to merge the TE IDs: the best way is to cluster between samples the sequences assembled in Trinity.fasta
, for example with cd-hit-est
. However, with 100+ samples, it can become quickly computationally intense as dnaPipeTE outputs a LOT of this contigs per runs. You can maybe first keep only those with enough read support (based on the quantifications in reads_per_components_and_annotation
) and then cluster.
Alternatively, (and if you are dealing with samples from the same species), you could merge all your input fastq, and run dnaPipeTE on it. This way it will assemble the TE overall, and you can then use it as a custom library in individual runs, to obtain per sample quantifications.
Please let me know if you'd like to discuss more in details about these different possibilities!
Cheers,
Clément
Hi Clément,
I think your idea of combining the files and running dnaPipeTE on the larger input is the approach I need to do at this step of my analysis instead of trying to run on each individual input with snakemake. Thank you for the information!
Cory
Great! Please let me know how it goes!
Cheers,
Clément
Hi,
I tried running this on the cluster with the commands you described above and with the combined input files compressed using gzip and received the following error message, although it seems to run OK when I do it interactively. Any thoughts?
Traceback (most recent call last):
File "dnaPipeTE.py", line 695, in
Hello Cory,
Can you share with me the commands you used as well as the whole log? Just to be sure, did the run complete successfully in interactive mode? (can you also share the commands for the interactive run?)
Thanks!
Clément
I created two scripts, the first one was to submit to the cluster and the second was referenced by the first to actually run dnaPipeTE (you can find them in the following two comments).
module load singularity singularity exec --bind /scratch/ch943/downloadSRA /scratch/ch943/downloadSRA/dnapipete.img /scratch/ch943/downloadSRA/runDnaPipeTE-2.sh
cd /scratch/ch943/downloadSRA/
cd /opt/dnaPipeTE python3 dnaPipeTE.py -input /scratch/ch943/downloadSRA/input.fastq.gz -output /scratch/ch943/downloadSRA/dnaPipeTE-output/test -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 250000000 -genome_coverage 5 -sample_number 2 -RM_t 0.2
The only other output I got was from dnaPipeTE saying that the program detected .gz compression. When I ran it interactively I did not have enough time to let it run to completion, but it at least made it through counting the reads which is further than this slurm attempt seemed to achieve. The interactive run mirrors what is in the second script I provided above.
Hello Cory!
I did a little changes to your script:
-bind
command was incomplete, as it needs to know which folder in the container will be mounted with the directory on the server (--bind /scratch/ch943/downloadSRA:/mnt
). By default, I use /mnt
in the container. From now on, /mnt
will by synonymous of /scratch/ch943/downloadSRA
for any command executed in the container.#!/bin/bash
#SBATCH --partition=main # Partition (job queue)
#SBATCH --requeue # Return job to the queue if preempted
#SBATCH --job-name=runTE # Assign a short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=1 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=8 # Cores per task (>1 if multithread tasks)
#SBATCH --mem=24000 # Real memory (RAM) required (MB)
#SBATCH --time=24:00:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm-runTE.%N.%j.out # STDOUT output file
#SBATCH --error=slurm-runTE.%N.%j.err # STDERR output file (optional)
module load singularity
singularity exec --bind /scratch/ch943/downloadSRA:/mnt /scratch/ch943/downloadSRA/dnapipete.img /mnt/runDnaPipeTE-2.sh
cd
as it was pointing to your servers' directory, and there is no need to cd
within this folder, as it will be accessible directly with /mnt
. I further replaced paths to the servers' directory to /mnt
for the input/output. Note that /scratch/ch943/downloadSRA/dnaPipeTE-output/
must exist, so test
can be created within.#!/bin/bash
cd /opt/dnaPipeTE
python3 dnaPipeTE.py -input /mnt/input.fastq.gz -output /mnt/dnaPipeTE-output/test -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 250000000 -genome_coverage 5 -sample_number 2 -RM_t 0.2
Let me know if it helps!
Cheers,
Clément
Hi I made the changes you suggested, but it still failed at the same place with the following output.
Traceback (most recent call last):
File "dnaPipeTE.py", line 695, in
Ouch,... Sorry... Could you try with a non gz-compressed input? I am worried bgzip
may not be configured properly in the container! 😨
If that's the case, I'll then fix it ASAP!
Cheers,
Clém
Let me give it a shot! I used .gz because of space issues in our cluster but I think I can make it work...
Thanks so much!
On Wed, May 4, 2022 at 2:29 PM Clément Goubert @.***> wrote:
Ouch,... Sorry... Could you try with a non gz-compressed input? I am worried bgzip may not be configured properly in the container! 😨 If that's the case, I'll then fix it ASAP!
Cheers,
Clém
— Reply to this email directly, view it on GitHub https://github.com/clemgoub/dnaPipeTE/issues/65#issuecomment-1117954870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBUWEXBSXVIFKFGEBNUDUTVILT4DANCNFSM5UAMRZZQ . You are receiving this because you authored the thread.Message ID: @.***>
Please re-open if the issue persist with v.1.4.c -- Best, Clément
Hello - I am trying to run this on ~110 samples on a cluster computing system where I do not have root access. I wrote a snakemake pipeline to run this on all samples iteratively but I can not get singularity to run within snakemake to load the container, however when I run this interactively on single samples I can load the container. Is there a way to get singularity to work within snakemake?
Also - is there a way to run this on all samples at once and merge the TE IDs into a single output?