CraigIDent / SpliSER

Bioinformatic tool for Splice site Strength Estimation using RNA-seq
14 stars 4 forks source link

'SpliSER process' taking a very long time to run #1

Open chrisAta opened 2 years ago

chrisAta commented 2 years ago

Hi!

I'm trying to run SpliSER to detect differentially used splice-sites based on two conditions for 16 different mouse samples. I aligned the RNASeq reads using HISAT2, producing BAM files that are around 4-5 GB in size each. I also produced the BED files using RegTools as recommended.

Now that I have everything to actually start the pipeline, I've run into a problem with the first step: 'process'. Essentially, it takes a very long time to run. I tried running it on one of the samples, and it was stuck on the first mouse chromosome for over an hour before I investigated further. I then limited SpliSER to just the mitochondrial chromosome - which only has 37 genes - and that alone took almost 8 minutes.

Given that the mouse genome has over 50 thousand genes, it sounds impossible to actually run SpliSER on my data in a reasonably amount of time. Is there anything I might be doing wrong, or is SpliSER not meant to be used for such organisms?

This is the line of code I used, in case it's helpful: python ../SpliSER/SpliSER_v0.1.7.py process -B Sample1.bam -b Sample1.bed -A GCF_000001635.27_GRCm39_genomic.gtf --isStranded -s rf -o spliser_output/Sample1

I appreciate any help!

Best regards, Chris

CraigIDent commented 1 year ago

Hi Chris, I'm sorry that I've only just seen this! It slipped through on the email somehow.

I agree that it's a problem that it's taking so long - I'm interested to hear if you ended up letting one run through, how long it took?

I'll check with some colleagues who have been running SpliSER on human data - their file sizes and runtime, and will get back to you.

Best, Craig

CraigIDent commented 1 year ago

Hi Chris, following up on this:

My colleagues have reported that for 4-5GB BAM files, some runs are taking as long as 4 days. While that might be possible to do in parallel on a HPC, it's not practical for many applications. Seems I've overlooked this in development because we working with smaller BAMs for our own human/mouse analyses.

I'll leave this issue open while we work on speeding-up the process step.

Best, Craig

chrisAta commented 1 year ago

Hi Craig!

I appreciate the followup, thank you! Good to know that this is normal behaviour for now - I'll check back again in the future when things are sped up :)

Best regards, Chris

chenkenbio commented 1 year ago

Hi Craig,

I tried to use the pysam package instead of calling samtools through subprocess and found it runs much faster: https://github.com/chenkenbio/SpliSER/blob/master/SpliSER_v0.1.7.pysam.py .

Best regards, Ken