MonashBioinformaticsPlatform / RNAsik-pipe

RNAsik - more than just a pipeline
https://monashbioinformaticsplatform.github.io/RNAsik-pipe/
Apache License 2.0
13 stars 5 forks source link

Help improve `alignerReads` function #14

Open serine opened 6 years ago

serine commented 6 years ago

Hi there,

I'm putting this out to the community who is willing to help up with this sickest pipeline - RNAsik ! I recently discovered a bug in my code this commit, specifically look at the comment in sikSTARaligner.bds file, starting on line 59.

Scope for this function:

The function takes two arguments:

And returns a list of maximum two strings. First element is R1 and second element R2 (if paired-end data). If data single end then list will only have one element in it.

Here is were the bug was, if data was split across lanes (this is irrespective whether data was paired-end or not) i.e two fastq files for one sample e.g this is single-end data split across multiple lanes, sample_a

alignerReads will return [sample_a_R1_001.fastq.gz,sample_R1_002.fastq.gz] Note that there is only one element in that list, because this is single end data, but there are two files there.

This is wanted because most aligners use this syntax to separate multiple lanes, and a space between R1 and R2 reads. For more information refer to STAR (or Hisat2) manual and search for --readFilesIn flag. BUT BigDataScript's dependencies don't recognise such string, as it can't understand the path to the file (fair enough!) so my quick workaround was to take, at the receiving end, and loop over alignerReads list and split each element on a comma and append into new deps (dependencies) list. This might be enough, however I haven't done this for the other aligners and it feels that a better solution would be if alignerReads returned a list of two lists, R1 and R2, in that order and at the receiving one can append those lists into deps list and then join appropriate reads into a string for aligner use. However there is a limitation ? of BigDataScript - it can't have nested lists so either fix it upstream, at BigDataScript or provide more elegant solution here.

One possible approach is only return R1's with very first element being a flag for paired-end data i.e true or false then at receiving end if paired-end true simply string readTwoStr = readOneStr.replace("_R1", "_R2"), but I'm not sure if this is any better.

Another possible implementation is to return a map with two keys, but again can't have list as a values, so will still be returning a string of files. I just thought that a map might a more elegant to work with at the receiving end i.e inside the aligner and I guess, will just have to cope with processing a string there.

So I'm putting this out here for any suggestions / comments / potential PR's (pull requests)

Cheers,