Xinglab / rmats-turbo

Other
228 stars 55 forks source link

run post step on a subset of bam files from the prep step #59

Closed magmir71 closed 3 years ago

magmir71 commented 3 years ago

I'm trying to use rMATS for studying differential splicing in a panel of various RNA-binding protein (RBP) knockdown (KD) experiments. More precisely, I have 20 RBPs, for every RBP I have 2 control replicates and 2 KD replicates (so, 80 bam files in total). I ran rMATS in --task prep mode, putting all the bam files into the b1.txt file, and using 80 threads. The prep step successfully accomplished.

Then I would like to run a post step for each RBP individually (so, run 20 post steps) but unfortunately rMATS requires all the same bam files to be present in the post step as well.

Is it possible to run the post step on a subset of bam files from the prep step?

EricKutschera commented 3 years ago

I would expect rMATS to be able to run the post step using any subset of the bam files from the prep step. Are you seeing an error message? Which version of rMATS are you using?

If you are able to build rMATS, then you could try building from the latest commit on the master branch. The issue may be resolved in the latest version of the code.

You could also try running the prep step separately for each bam file. Then for each post step, you could use the individual .rmats files corresponding to the bams for that comparison.

magmir71 commented 3 years ago

Thanks for a quick response!

I'm getting errors like this: image rMATS reports all the files that are not in the -b1 and -b2 files for the post task, but are present in *.rmats files.

Actually I have a dozen of *.rmats files in the tmp_prep directory, which I use as an input for the post step (I don't have enough memory to load all the bam files simultaneously). I used the cp_with_prefix.py helper script. The real data set which I need to process is 800+ bam files.

I installed v. 4.1.0 from source using ./build_rmats --conda. test_rmats returned ok for all tests except for paired models, but I don't use --paired-stats for my jobs.

I tried to run rMATS in --task both and it worked well. I'm satisfied with the results. However, it takes too long to process my data set in this mode so I thought I could use the prep feature for parallelization.

I'll try to build from master branch.

Best, Aleksei

magmir71 commented 3 years ago

I installed the master version. I ran the post step on a subset of samples, and it reported the following: image

My questions are: 1) Is it ok that JC and JCEC event counts do not match? 2) I see that the gtf file is loaded on the post step, but why? I thought that *.rmats files already contain all the necessary information for the post step. 3) Does post step need multiple threads? What would be more efficient - run post steps one by one with a large number of threads in each one, or run multiple post steps with & in bash?

EricKutschera commented 3 years ago
  1. Yes, it is ok that the JC and JCEC event counts do not match. That warning message was added to the code based on a misunderstanding (I plan to remove the warning). While the set of events that rMATS detects (in the fromGTF.*.txt files) does not depend on the exon counts (EC), it is actually expected that there can be more events in the final *.MATS.JCEC.txt than in the JC file. This is because the additional exon counts may allow an event to pass a filtering step that removes events with zero counts
  2. The .rmats file records the information from the bam files efficiently based on the gtf. The reads are grouped according to known/novel splice junctions, and also the end coordinates of reads are converted to bounding boxes based on exon boundaries. The grouping allows the reads to be counted together instead of recording the exact position information of each read. The post step then needs to load the gtf in order to interpret the .rmats file and also to use the annotated transcripts as part of the splicing event detection
  3. The post step can use up to 1 thread per .rmats file and each thread will load that .rmats file into memory. Using multiple threads in each post step and also running multiple different post steps at once could be a good strategy depending on the total RAM and CPUs available
magmir71 commented 3 years ago

Thanks!

magmir71 commented 3 years ago

When I run rMATS in prep mode in master version, the output is different from the 4.1.0 version. In 4.1.0 I get just one .rmats file for multi-bam input, in master version I get many .rmats files (I suppose, each one corresponds to each bam file in the input).

When I use the .rmats files in post step (also in master version), I get the following error: image

Of note, I don't get this error if I use .rmats files generated by v.4.1.0, but run the post step with master version.

magmir71 commented 3 years ago

In -task both mode everything works fine

EricKutschera commented 3 years ago

It looks like this is the line of code from the error message: https://github.com/Xinglab/rmats-turbo/blob/cbb165343f7683bb808a9c4b03374b943cb1fd19/rMATS_pipeline/rmatspipeline/rmatspipeline.pyx#L3218

It's trying to read all the .rmats files in the --tmp dir to make sure there is one bam per .rmats file (which is a change since v4.1.0 to make things more efficient). It's expecting the second line of each file to be the read length. The error message looks like the second line was blank: ValueError: invalid literal for int() with base 10: ''

I'm not sure how that line could end up blank. Could you take a look at the first few lines of your .rmats files to see if you see the blank line? Could you post the commands you ran and the commit you built rmats from?

magmir71 commented 3 years ago

I found that the problem arises when there are empty .rmats files in the --tmp directory. In my case, these empty files were remnants from earlier runs that ended up with errors.

I would suggest to modify the rmatspipeline.pyx code so that it could first check that the .rmats file contains the path to a bam file in the first string, and, if not, just ignore it. However, my case may not be very common, and the problem is certainly mostly attributed to incorrect use.

Best, Aleksei