TheJacksonLaboratory / splicing-pipelines-nf

Repository for the Anczukow-Lab splicing pipeline
14 stars 9 forks source link

Unexpected rMATS error #204

Closed sk-sahu closed 3 years ago

sk-sahu commented 3 years ago

As reported in this slack thread - Out of complete random sometime (because other times it work as expected) rmats process fails with this following error.

ERROR: while performing statistical analysis, user should provide two groups of samples. Please check b1,b2 or s1,s2.

Faild jobs -

Success jobs -

The problem seems to be with defining b1 and b2. It looks like b1 is made correctly but for some reason b2 is blank.

This is something to do with the logic mentioned here to produce two files b1.txt and b2.txt.

I checked the error from back-end (work dir), although both files generated b2.txt is empty.

Reproducing try

To reproduce this I extrapolated this script from main.nf

test_group.nf (click to expand)

```nextflow bams = "old_job/MYC_high_vs_low_bams_forCloudOS_updated2.csv" rmats_pairs = "old_job/BRCA_MYC_low_v_high_rmatsPairs_revised3.txt" Channel .fromPath(bams) .ifEmpty { exit 1, "Cannot find BAMs csv file : ${bams}" } .splitCsv(skip:1) .map { name, bam, bai -> [ name, file(bam), file(bai) ] } .into { indexed_bam; indexed_bam_rmats } indexed_bam_rmats .map { name, bam, bai -> [name, bam] } .set { bam } Channel .fromPath(rmats_pairs) .ifEmpty { exit 1, "Cannot find rMATS pairs file : ${rmats_pairs}" } .splitCsv(sep:' ') .map { row -> def rmats_id = row[0] def b1 = row[1].toString().split(',') def b2 = row[2].toString().split(',') [ rmats_id, b1, b2 ] } .set { samples} samples .map { row -> def samples_rmats_id = [] def rmats_id = row[0] def b1_samples = row[1] def b2_samples = row[2] b1_samples.each { sample -> samples_rmats_id.add([sample, 'b1', rmats_id]) } b2_samples.each { sample -> samples_rmats_id.add([sample, 'b2', rmats_id]) } samples_rmats_id } .flatMap() .combine(bam, by:0) .map { sample_id, b, rmats_id, bam -> [ rmats_id + b, rmats_id, bam] } .groupTuple() .map { b, rmats_id, bams -> [rmats_id[0], [b, bams]] } .groupTuple() .map { rmats_id, bams -> def b1_bams = bams[0][0].toString().endsWith('b1') ? bams[0] : bams[1] def b2_bams = bams[0][0].toString().endsWith('b2') ? bams[0] : bams[1] rmats_id_bams = b2_bams == null ? [ rmats_id, b1_bams[1], "no b2", true ] : [ rmats_id, b1_bams[1] , b2_bams[1], false ] rmats_id_bams } .set { bams } //bams.view() process rmats { echo true input: set val(rmats_id), file(bams), file(b2_bams), val(b1_only) from bams script: if (b1_only) { b1_bams = bams.join(",") b2_cmd = '' b2_flag = '' b2_config_cmd = '' } else { b1_bams = bams.join(",") b2_bams = b2_bams.join(",") b2_cmd = "echo b2.txt $b2_bams" b2_flag = "--b2 b2.txt" b2_config_cmd = "echo b2 b2.txt >> \$rmats_config" } """ echo b1.txt $b1_bams $b2_cmd """ } ```

Run

nextflow run test_group.nf

but it works completely fine as expected (creates two b1.txt and b2.txt).

sk-sahu commented 3 years ago

As @cgpu suggested

The error seems to be coming from this check:

https://github.com/Xinglab/rmats-turbo/blob/8520f7df122b1690efbf836ec3ce63512a0cbd27/rmats.py#L176-L179

    if (args.task != 'prep' and args.stat
        and (len(args.b1) * len(args.b2) == 0)
        and (len(args.s1) * len(args.s2) == 0)):
        sys.exit('ERROR: while performing statistical analysis, user should provide two groups of samples. Please check b1,b2 or s1,s2.')

We need to inspect if one of this conditions is violated but also find the actual version (this one is from the latest for quickness).

cgpu commented 3 years ago

@lmurba @angarb pinging you in case you find a similar error, we tracked down the check in the source code from where this is printed, see here

The only exact google match I find is from a user forum here:

image

angarb commented 3 years ago

@cgpu thanks! This error makes sense though, because the pipeline is not properly making b2.txt - therefore rMATS cannot find it. It is unclear why the b2 generation is glitching on the cloud.

cgpu commented 3 years ago

Thanks @angarb, is it working as expected on Sumner? If so, we might get some info on what's different and debug.

angarb commented 3 years ago

@cgpu With other datasets, this step works fine on Sumner. With this particular TCGA input 'bams.csv', we have not yet tested it on Sumner (since we were uncertain how to access the TCGA bam files in the cloud on Sumner). @sk-sahu informed us that we can just give paths to the google buckets in the bams.csv we should be able to access them on Sumner. However, @sk-sahu did test the portion of the script that generates b1 and b2 locally and this did not error.

angarb commented 3 years ago

@cgpu The other piece of evidence to suggest it is a cloud issue is that it works sporadically. There is definitely a randomness to whether or not b2 is generated from the same bams.csv file and rmats_pairs files. @lmurba has done a lot of testing with this

cgpu commented 3 years ago

@angarb @lmurba thanks both, will work along @sk-sahu of this and keep this thread up to date.

sk-sahu commented 3 years ago

@angarb @lmurba @cgpu This unexpected issue of not generating b1.txt and b2.txt is fixed (This fix is from lifebit copy, will bring the changes to this JAX copy as well, Anyway going forward this will be the only repo as now this can be imported)

Fix description - There was a channel typo, so it couldn't able to fetch the proper files into the process.

After fix, Tested twice and in both cases it able to generate b1.txt and b2.txt image

image