clemgoub / dnaPipeTE

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).
50 stars 11 forks source link

Singularity issue #65

Closed cahende closed 2 years ago

cahende commented 2 years ago

Hello - I am trying to run this on ~110 samples on a cluster computing system where I do not have root access. I wrote a snakemake pipeline to run this on all samples iteratively but I can not get singularity to run within snakemake to load the container, however when I run this interactively on single samples I can load the container. Is there a way to get singularity to work within snakemake?

Also - is there a way to run this on all samples at once and merge the TE IDs into a single output?

clemgoub commented 2 years ago

Hello Cory!

Unfortunately, I don't have experience with Snakemake... Can you tell me what scheduler (if any) you are using on your computing platform?

In my case (slurm) I do the following:

For each run I generate and submit a sbatch file looking like this:

module load singularity
singularity exec --bind ~Project:/mnt ~/dnaPipeTE/dnapipete.img /mnt/dnaPipeTE_cmd.sh

~Project is my local directory that contains the reads and will host the outputs.

The dnaPipeTE_cmd.sh file that contains the actual dnaPipeTE commands looks like:

#! /bin/bash 
cd /opt/dnaPipeTE 
python3 dnaPipeTE.py -input /mnt/reads_input.fastq -output /mnt/output -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 170000000 -genome_coverage 0.1 -sample_number 2 -RM_t 0.2 -cpu 2 

Currently there isn't a designated way to run samples in batch with dnaPipeTE. Though this is a good suggestion, and I will consider including this in the current development of dnaPipeTE2.

Regarding how to merge the TE IDs: the best way is to cluster between samples the sequences assembled in Trinity.fasta, for example with cd-hit-est. However, with 100+ samples, it can become quickly computationally intense as dnaPipeTE outputs a LOT of this contigs per runs. You can maybe first keep only those with enough read support (based on the quantifications in reads_per_components_and_annotation) and then cluster.

Alternatively, (and if you are dealing with samples from the same species), you could merge all your input fastq, and run dnaPipeTE on it. This way it will assemble the TE overall, and you can then use it as a custom library in individual runs, to obtain per sample quantifications.

Please let me know if you'd like to discuss more in details about these different possibilities!

Cheers,

Clément

cahende commented 2 years ago

Hi Clément,

I think your idea of combining the files and running dnaPipeTE on the larger input is the approach I need to do at this step of my analysis instead of trying to run on each individual input with snakemake. Thank you for the information!

Cory

clemgoub commented 2 years ago

Great! Please let me know how it goes!

Cheers,

Clément

cahende commented 2 years ago

Hi,

I tried running this on the cluster with the commands you described above and with the combined input files compressed using gzip and received the following error message, although it seems to run OK when I do it interactively. Any thoughts?

Traceback (most recent call last): File "dnaPipeTE.py", line 695, in Sampler = FastqSamplerToFasta(args.input_file, args.sample_size, args.genome_size, args.genome_coverage, args.sample_number, args.output_folder, False) File "dnaPipeTE.py", line 141, in init self.sampling_files() File "dnaPipeTE.py", line 294, in sampling_files self.get_sampled_id(self.fastq_R1) File "dnaPipeTE.py", line 161, in get_sampled_id for line in file1: File "/usr/lib/python3.5/gzip.py", line 287, in read1 return self._buffer.read1(size) File "/usr/lib/python3.5/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/usr/lib/python3.5/gzip.py", line 452, in read self._read_eof() File "/usr/lib/python3.5/gzip.py", line 499, in _read_eof hex(self._crc))) OSError: CRC check failed 0x2664c8e5 != 0xfd5b61bd

clemgoub commented 2 years ago

Hello Cory,

Can you share with me the commands you used as well as the whole log? Just to be sure, did the run complete successfully in interactive mode? (can you also share the commands for the interactive run?)

Thanks!

Clément

cahende commented 2 years ago

I created two scripts, the first one was to submit to the cluster and the second was referenced by the first to actually run dnaPipeTE (you can find them in the following two comments).

cahende commented 2 years ago

!/bin/bash

SBATCH --partition=main # Partition (job queue)

SBATCH --requeue # Return job to the queue if preempted

SBATCH --job-name=runTE # Assign a short name to your job

SBATCH --nodes=1 # Number of nodes you require

SBATCH --ntasks=1 # Total # of tasks across all nodes

SBATCH --cpus-per-task=8 # Cores per task (>1 if multithread tasks)

SBATCH --mem=24000 # Real memory (RAM) required (MB)

SBATCH --time=24:00:00 # Total run time limit (HH:MM:SS)

SBATCH --output=slurm-runTE.%N.%j.out # STDOUT output file

SBATCH --error=slurm-runTE.%N.%j.err # STDERR output file (optional)

module load singularity singularity exec --bind /scratch/ch943/downloadSRA /scratch/ch943/downloadSRA/dnapipete.img /scratch/ch943/downloadSRA/runDnaPipeTE-2.sh

cahende commented 2 years ago

!/bin/bash

cd /scratch/ch943/downloadSRA/

cd /opt/dnaPipeTE python3 dnaPipeTE.py -input /scratch/ch943/downloadSRA/input.fastq.gz -output /scratch/ch943/downloadSRA/dnaPipeTE-output/test -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 250000000 -genome_coverage 5 -sample_number 2 -RM_t 0.2

cahende commented 2 years ago

The only other output I got was from dnaPipeTE saying that the program detected .gz compression. When I ran it interactively I did not have enough time to let it run to completion, but it at least made it through counting the reads which is further than this slurm attempt seemed to achieve. The interactive run mirrors what is in the second script I provided above.

clemgoub commented 2 years ago

Hello Cory!

I did a little changes to your script:

#!/bin/bash
#SBATCH --partition=main # Partition (job queue)
#SBATCH --requeue # Return job to the queue if preempted
#SBATCH --job-name=runTE # Assign a short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=1 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=8 # Cores per task (>1 if multithread tasks)
#SBATCH --mem=24000 # Real memory (RAM) required (MB)
#SBATCH --time=24:00:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm-runTE.%N.%j.out # STDOUT output file
#SBATCH --error=slurm-runTE.%N.%j.err # STDERR output file (optional)

module load singularity
singularity exec --bind /scratch/ch943/downloadSRA:/mnt /scratch/ch943/downloadSRA/dnapipete.img /mnt/runDnaPipeTE-2.sh
#!/bin/bash
cd /opt/dnaPipeTE
python3 dnaPipeTE.py -input /mnt/input.fastq.gz -output /mnt/dnaPipeTE-output/test -RM_lib ../RepeatMasker/Libraries/RepeatMasker.lib -genome_size 250000000 -genome_coverage 5 -sample_number 2 -RM_t 0.2

Let me know if it helps!

Cheers,

Clément

cahende commented 2 years ago

Hi I made the changes you suggested, but it still failed at the same place with the following output.

Traceback (most recent call last): File "dnaPipeTE.py", line 695, in Sampler = FastqSamplerToFasta(args.input_file, args.sample_size, args.genome_size, args.genome_coverage, args.sample_number, args.output_folder, False) File "dnaPipeTE.py", line 141, in init self.sampling_files() File "dnaPipeTE.py", line 294, in sampling_files self.get_sampled_id(self.fastq_R1) File "dnaPipeTE.py", line 161, in get_sampled_id for line in file1: File "/usr/lib/python3.5/gzip.py", line 287, in read1 return self._buffer.read1(size) File "/usr/lib/python3.5/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/usr/lib/python3.5/gzip.py", line 452, in read self._read_eof() File "/usr/lib/python3.5/gzip.py", line 499, in _read_eof hex(self._crc))) OSError: CRC check failed 0x2664c8e5 != 0xfd5b61bd

clemgoub commented 2 years ago

Ouch,... Sorry... Could you try with a non gz-compressed input? I am worried bgzip may not be configured properly in the container! 😨 If that's the case, I'll then fix it ASAP!

Cheers,

Clém

cahende commented 2 years ago

Let me give it a shot! I used .gz because of space issues in our cluster but I think I can make it work...

Thanks so much!

On Wed, May 4, 2022 at 2:29 PM Clément Goubert @.***> wrote:

Ouch,... Sorry... Could you try with a non gz-compressed input? I am worried bgzip may not be configured properly in the container! 😨 If that's the case, I'll then fix it ASAP!

Cheers,

Clém

— Reply to this email directly, view it on GitHub https://github.com/clemgoub/dnaPipeTE/issues/65#issuecomment-1117954870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBUWEXBSXVIFKFGEBNUDUTVILT4DANCNFSM5UAMRZZQ . You are receiving this because you authored the thread.Message ID: @.***>

clemgoub commented 2 years ago

Please re-open if the issue persist with v.1.4.c -- Best, Clément