MonashBioinformaticsPlatform / RNAsik-pipe

RNAsik - more than just a pipeline
https://monashbioinformaticsplatform.github.io/RNAsik-pipe/
Apache License 2.0
13 stars 5 forks source link

Pair detection fails when pairId appears twice in filename #53

Open pansapiens opened 4 years ago

pansapiens commented 4 years ago

eg, input files sampleA_1-WT-V-A_1.fq.gz and sampleA_1-WT-V-A_2.fq.gz with flags -paired -pairIds _1,_2 -extn .fq.gz fails with an error:

Fatal error: /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.3/bin/../opt/rnasik-1.5.3/src/sikFqFiles.bds, line 247, pos 17. -paired set to true, but can't find _2 read. Is it single-end data? Also check your -pairIds _1,_2
Stack trace:
error "-paired set to $paired, but can't find $pai ...  # /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.3/bin/../opt/rnasik-1.5.3/src/sikFqFiles.bds:247
  samplesSheet = makeSamplesSheet( fqFiles,fqRgxs, ...  # /scratch/pl41/laxy/jobs/miniconda3/envs/rnasik-1.5.3/bin/../opt/rnasik-1.5.3/src/RNAsik.bds:93

This is because when converting the _1 filename into the _2 filename to verify that paired files exist, string.replace is used but the substring _1 occurs twice in the first read pair.

One solution is to enforce that the pairId must be immediately before the extension, like this: https://github.com/pansapiens/RNAsik-pipe/commit/914f0297b578c5a6a20c37820d6a2688833f7117

(the side effect of this patch would be that typical Illumina instrument output eg somereads_R1_001.fastq.gz and somereads_R2_001.fastq.gz you'd probably need to specify -paired -pairIds _R1_001, R2_001 -extn .fastq.gz, or maybe -paired -pairIds _R1, R2 -extn _001.fastq.gz - untested).

pansapiens commented 4 years ago

https://github.com/MonashBioinformaticsPlatform/RNAsik-pipe/issues/45 should probably take priority here, since it would enable explicitly working around any of these types of pair detection issues.

pansapiens commented 3 years ago

Still occurs in RNAsik 1.5.4.