GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

Estimate of processing time #117

Open Caffeinated-Code opened 9 months ago

Caffeinated-Code commented 9 months ago

Hi, I am hoping to get an idea of how long the processing times are for TAMA collapse. My input data with ~3.5 M Nanopore reads has been processing for close to 24 hrs now. Wondering if it usually takes this long and if there are options to expedite this.

ETA: A downsample of ~400k reads took ~93hrs to process. Are these processing times expected?

Best, Swathi

Caffeinated-Code commented 9 months ago

Hi, I have been splitting the input SAM into batches, running the collapse, and merging them as suggested. I do have a couple more questions regarding this workflow, hope you can help me understand it better.

1) My input SAM has 36.3 million reads and comes from a target capture enrichment of select regions on a particular chromosome.
From the code, it looks like the splits are chromosome-wise and the resulting 17 uneven splits are hence understandable. I have the bulk of the reads on one particular chromosome and TAMA collapse faces a memory error when processing that particular split *R1_5.sam. Subsetting the BAM to this one chromosome wouldn't help as I will get only split file based on the chromosome. Is there a way I can process them in batches of a million and put them back together with TAMA merge reliably?

Sizes of split files:

image

ETA: Defined split regions based on my data - non overlapping coordinates in my region of interest and tried TAMA collapse. Split data size in the order of 200-500k reads, still time-consuming to run collapse

2) The TAMA merge process outputs an empty *_trans_read.bed. I don't see it mentioned as one of the outputs in the TAMA merge documentation as well. Is there a way to obtain this file that usually maps the read IDs to the transcript model IDs Example: From Col4 of trans_read.bed G2.3;59:1307|fcf8906e-166e-471c-9d5c-e3758e2e80c0