Closed gpertea closed 2 years ago
Looking at the junctionCount() code there are a few strange things that should be checked:
strandSpecific
param should always be passed as TRUE, even for unstranded (polyA) data ?read.delim()
calls are made with skip=1
but I do not see a header to be skipped in the junctions_primaryOnly_regtools.count
files?Also, noticed that replacing read.delim() calls with data.table fread()
makes the file reading about 5 times faster (for a test with hundreds of files)
For your first point, if I remember correctly we were trying to avoid assignment of "*" for strand, in which case, yes, we should always use strandSpecific=TRUE
given the code here. Should we always use TRUE
in general, or does this apply only to the BSP1 data?
For the second point, I would indeed consider this a bug in the jaffelab
package: if the junctionCount
code was written with regtools .count
outputs in mind, skipping the first line is definitely not desired. At the very least we could maybe add a skip
argument to the junctionCount
constructor (and likely its default value should be 0, not 1!).
I also agree the junctionCount
function can be optimized, and at some point this will be absolutely necessary, as for larger RNA-seq datasets (100s of samples) we've had memory usage go even above 200GB.
About that strandSpecific
setting, I have a deja vu - https://jhu-genomics.slack.com/archives/C28PVLRQW/p1614708233012300 I think we asked ourselves that very question before. It looks like that should be always T, a.k.a get rid of it.. I thought that was the conclusion (and fix) applied back then
That's why I think this test should likely settle the issue, right? I can check the generated junctions to make sure that their strand is correct.
Thank you for confirming the skip=1
bug, I thought I was missing something there (the count files somehow being stripped of their header after that code was run.. just for kicks).
I tried a full data.table
-based rewrite for that junctionCounts() code, speed can be improved a lot but the memory exploded badly on my first naive attempt. Andrew's smart use of RLE vectors there did a good job at keeping memory usage under control somewhat; the only improvement I can imagine (besides replacing read.delim
with fread
, that's just a speed gain) would be switching to full blown incremental dgCMatrix building (and also use that in the resulting RSE).
Ah, I totally forgot about that conversation... so just to make sure, do you mean we should modify junctionCount
to only have the behavior seen when strandSpecific=TRUE
, or that SPEAQeasy should always pass TRUE
?
I'll make an issue on the jaffelab
repo for the skip
problem.
I guess the safest/most conservative take right now would be to always pass TRUE
from SPEAQeasy. Perhaps when we are ready to improve/rewrite junctionCounts()
more substantially we could reevaluate -- are there any use cases where that parameter being FALSE
(and setting junction strands to *
) makes sense? A full inventory of all possible calls to that function in other code might be needed..
Anyway it looks like fixing that skip=1
bug in read.delim()
does involve rebuilding the jaffelab
package, right? Unless you want to try an out of package function instead, (renamed and modified junctionCounts()
function directly defined in the create_count_objects.R
script maybe..).
Leo fixed the skip=1
bug, and I updated SPEAQeasy to use TRUE
for strandSpecific
. I now have SPEAQeasy running.
The SPEAQeasy processing is done, with outputs under /dcs04/lieber/lcolladotor/BrainSEQ_LIBD001/brainseq_phase1/SPEAQeasy_reprocess/outputs/
.
Something is still odd with the junction coordinates - again only 544 out of 2,339,867 reported junctions in the rse_jx object match the known Gencode 25 junction coordinates.
(the number should be above 250,000 as it is the case for all the other datasets)
This is actually reported in the Class column of rowData:
> table(rowRanges(rse_jx)$Class)
AltStartEnd ExonSkip InGen Novel
739107 226 544 1599990
I suspect an off-by-one error of sorts, somehow caused by recent versions of the pipeline? (or it could be regtools, I am not sure)
Weirdly enough, ~500,000 junctions have their end coordinates match GENCODE, but very few (a few thousand) match by start coordinate. Do you have an idea of the date of the last run of SPEAQeasy producing good junction coordinates (ideally with GENCODE v25, but either way)? I'll try to investigate some more.
Wow, this actually looks like an "off-by-two" bug for the start coordinate (!) That is, if I add 2 to the start coordinate in rowRanges() then the match stats become somewhat sane (checking the coordinates against the junction database) :
full match: 288214 junctions
match start: 742556
match end: 740417
I initially checked this for all the other 11 "official" RNAseq datasets, none of them have this issue. I guess the most recent was processed some time in 2020?
The first dataset where I encountered this was the BSP1 rse_jx that was rebuilt in March this year. Meanwhile I processed some other smaller datasets but did not think to check the junctions -- I just did now and they have the exact same problem (increasing the start coordinate by 2 also produces normal stats..). Even though I am using an older/custom SPEAQeasy branch (still with R 3.6.1), I am using the same JHPCE modules for regtools
and likely the same jaffelab
R package (?) that you have been using.
Anyway this is now a separate issue, unrelated to BSP1 reprocessing -- I guess this issue should stay closed and we should start a new one to clarify this.
Just to have it documented here, I fixed the start coordinate for rse_jx and added pheno data to all the newly generated RSE files, saved here:
/dcs04/lieber/lcolladotor/BrainSEQ_LIBD001/brainseq_phase1/SPEAQeasy_reprocess/RSE_fixed
There is a README in there explaining why there was an additional rse_jx file (with more stringent filtering of novel junctions, so only ~1 million junctions are in that file, as opposed to the original one which has > 2.3 million)
Reasons:
FASTQ files located in
/dcl01/ajaffe/data/Nina/BrainSeq_PhaseI/DLPFC_PolyA/