MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

Counts.txt file not created when using .bam files as input #150

Closed mikrzol closed 3 months ago

mikrzol commented 4 months ago

I am trying to use ShortStack for analyzing a large sRNA-seq experiment (132 samples). The alignment part takes a very long time (almost a full week) on a supercomputer server I am working on, despite specifying 12 threads, 20 GB RAM per thread.

The organism is barley (Hordeum vulgare). I use the toplevel genome without any masking and added mitochondrion, nonchromosomal and chloroplast sequences, which makes the reference rather large, but I got poor results when using a masked genome.

Since I can only submit a job for max a week on the server, I only managed to get the .bam files. I thought I could use them for further analysis by using the --bamfiles rather than --readfiles option to get the results and it works fine, but I don't get the Counts.txt file I would need for DE Seq analysis.

Is this an expected limitation when using .bam files as input?

MikeAxtell commented 4 months ago

Thanks Michał for the feedback and questions.

About slow alignments: The large number of libraries and the highly repetitive nature of the barley genome are the sources of the slowness. ShortStack's alignments take care to make best guesses at where to place multi-mapped reads (see Johnson et al. G3). But that care comes at the cost of extra time, especially when there are a lot of repeats. You can make it easier in at least two ways:

  1. Run multiple --align_only jobs with just small numbers of libraries at a time. Then combine all of the resulting BAM files.

  2. Run in --mmap -r mode ... just guess at where to drop multi-mapped reads. Will be much faster, at the cost of accuracy.

  3. Align with another tool (like sensible bowtie defaults) and then merge the BAMs before running ShortStack.

About the slow performance with a single BAM file: With 128 libraries and a giant, repetitive genome, I'm not surprised that it is slow. You may want to call cluster using ShortStack with just a subset of the libraries, instead of all 128.

If your input BAM file has read groups specified, yes, you should receive a Counts.txt file.

Hope this helps.

From: Michał Stanoch @.> Date: Tuesday, April 23, 2024 at 5:47 AM To: MikeAxtell/ShortStack @.> Cc: Subscribed @.***> Subject: [MikeAxtell/ShortStack] Counts.txt file not created when using .bam files as input (Issue #150)

I am trying to use ShortStack for analyzing a large sRNA-seq experiment (132 samples). The alignment part takes a very long time (almost a full week) on a supercomputer server I am working on, despite specifying 12 threads, 20 GB RAM per thread.

The organism is barley (Hordeum vulgare). I use the toplevel genome without any masking and added mitochondrion, nonchromosomal and chloroplast sequences, which makes the reference rather large, but I got poor results when using a masked genome.

Since I can only submit a job for max a week on the server, I only managed to get the .bam files. I thought I could use them for further analysis by using the --bamfiles rather than --readfiles option to get the results and it works fine, but I don't get the Counts.txt file I would need for DE Seq analysis.

Is this an expected limitation when using .bam files as input?

— Reply to this email directly, view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/150, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABUJPCK4Z6JNL5HFD3KVIGLY6YU2BAVCNFSM6AAAAABGURAHHKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TQNBSHA3TGNQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

MikeAxtell commented 3 months ago

I took another look at this.

When more than one BAM is given to --bamfile no Counts.txt is created. This doesn't seem like the behavior that I intended. But there may have been a technical reason ... I will look into it and report back on this thread.

The workaround is simple: First merge your multiple bamfiles making sure to mark readgroups per file names using the -r switch on samtools merge :

samtools merge --threads [n_threads] -r -o merged.bam in1.bam in2.bam ...

You can then input the merged.bam file as a single argument to ShortStack's --bamfile option, and Counts.txt file will be created.

MikeAxtell commented 3 months ago

As of commit 5335d2ee9f1168a667e866d8746a250a0ce900ee this enhancement has been added. Now, when user inputs more than one BAM file to --bamfile, a Counts.txt will be created.

This will be included in the next release, 4.0.4, which should be out soon.

mikrzol commented 3 months ago

Hi Prof. Axtell,

thanks for the update. Glad I could point out a bug and it'll get fixed soon!

Thanks also for the suggestion to split the alignment and merge the alignments later. I submitted the jobs to perform in parallel on the server I'm using for the analysis and I ran into some issues, but it was probably because of wrong versions of tools being used for some reason (e.g., the default server samtools rather than the one installed in the conda environment with ShortStack).

I'll keep you updated.

MikeAxtell commented 3 months ago

Resolved in release 4.0.4

mikrzol commented 3 months ago

Hi Prof. Axtell,

thanks for resolving the issue. I tried using ShortStack (4.0.3) to first align files in batches, then merging them into one file with samtools merge -r, and finally using the resulting file as the input, but I still didn't get the Counts.txt file (despite using only one merged.bam file).

I hope updating ShortStack will solve the issue. Would you kindly add the newest version to bioconda for ease of access?

MikeAxtell commented 3 months ago

Yes, version 4.0.4 should solve this. New releases are automatically bumped by bionconda. Except that right now the bioconda system for updating recipes is broken ... see https://github.com/bioconda/bioconda-recipes/issues/41025 ShortStack v4.0.4 will show up on bioconda as soon as bioconda's team solves their current technical issues. That is out of my hands.

In the meantime you can just download the ShortStack v4.0.4 script directly from this repo and execute it in a suitable environment.

dirkjanvw commented 3 months ago

Hi! Just a quick note that ShortStack v4.0.4 is now available via bioconda :) The issues with the bioconda system seem solved.