LieberInstitute / SPEAQeasy

SPEAQeasy: portable LIBD RNA-seq pipeline using Nextflow. Check http://research.libd.org/SPEAQeasy-example/ for an example on how to use this pipeline and analyze the resulting output files.
http://lieberinstitute.github.io/SPEAQeasy
MIT License
6 stars 4 forks source link

Make code more robust #67

Closed lcolladotor closed 3 years ago

lcolladotor commented 3 years ago

Hi,

While writing #66 and in particular, while checking https://github.com/LieberInstitute/SPEAQeasy/blob/master/scripts/build_annotation_objects.R#L125 I noticed that some R commands are written in such a way that they assume that the columns will always be the same. That is not necessarily the case. For example geneMap[geneMap$type == 'gene', c(1:5, 10, 13)] assumes that we'll always want columns 1 through 5, 10 and 13. Instead, I recommend using a character vector here. Let's say c("seqnames", "start", "end", "width", "strand", "gene_id", "gene_name"). We went through a similar process at https://github.com/LieberInstitute/spatialDLPFC/issues/4.

You can use the Gencode v37 and v25 output I included at #66 and check if names have changed for human versions across time. Likely they haven't in if they did, then it's also likely that column numbers changed too. With column names, we'll get an error if things changed. With numbers, we won't necessarily get an error (the columns could be of the same type) and it'll be harder to identify the source of the error.

Best, Leo

Nick-Eagles commented 3 years ago

I've solved the original issue described in build_annotation_objects.R. However, there are a few other segments of code that are similarly fragile (hardcode indices to data that could change with time). I think this issue can be considered solved once similar changes are made here, here and here. In the last two links, the entire code (for parsing FastQC and HISAT2 logs) will likely have to be re-written.

Nick-Eagles commented 3 years ago

For the first link from above, there doesn't appear to be an obvious way to make the code more robust (there are no colnames or labels to use when splitting the target_id column by the | delimiter). For this reason, I won't consider that code segment as part of the current issue to solve. I'll soon commit fixes for the latter two links in the above comment, and at that point I'll consider the issue resolved.