reprocess BrainSeq Phase 1 data

LieberInstitute / SPEAQeasy

SPEAQeasy: portable LIBD RNA-seq pipeline using Nextflow. Check http://research.libd.org/SPEAQeasy-example/ for an example on how to use this pipeline and analyze the resulting output files.

http://lieberinstitute.github.io/SPEAQeasy

MIT License

6 stars 4 forks source link

reprocess BrainSeq Phase 1 data #82

Closed gpertea closed 2 years ago

gpertea commented 2 years ago

Reasons:

reprocess with the latest HISAT2
fix rse_jx data properly
test the pipeline on unstranded (polyA) data

FASTQ files located in /dcl01/ajaffe/data/Nina/BrainSeq_PhaseI/DLPFC_PolyA/

gpertea commented 2 years ago

Looking at the junctionCount() code there are a few strange things that should be checked:

[x] according to this code does it mean that strandSpecific param should always be passed as TRUE, even for unstranded (polyA) data ?
[x] read.delim() calls are made with skip=1 but I do not see a header to be skipped in the junctions_primaryOnly_regtools.count files?

Also, noticed that replacing read.delim() calls with data.table fread() makes the file reading about 5 times faster (for a test with hundreds of files)

Nick-Eagles commented 2 years ago

For your first point, if I remember correctly we were trying to avoid assignment of "*" for strand, in which case, yes, we should always use strandSpecific=TRUE given the code here. Should we always use TRUE in general, or does this apply only to the BSP1 data?

For the second point, I would indeed consider this a bug in the jaffelab package: if the junctionCount code was written with regtools .count outputs in mind, skipping the first line is definitely not desired. At the very least we could maybe add a skip argument to the junctionCount constructor (and likely its default value should be 0, not 1!).

I also agree the junctionCount function can be optimized, and at some point this will be absolutely necessary, as for larger RNA-seq datasets (100s of samples) we've had memory usage go even above 200GB.

gpertea commented 2 years ago

About that strandSpecific setting, I have a deja vu - https://jhu-genomics.slack.com/archives/C28PVLRQW/p1614708233012300 I think we asked ourselves that very question before. It looks like that should be always T, a.k.a get rid of it.. I thought that was the conclusion (and fix) applied back then

That's why I think this test should likely settle the issue, right? I can check the generated junctions to make sure that their strand is correct.

Thank you for confirming the skip=1 bug, I thought I was missing something there (the count files somehow being stripped of their header after that code was run.. just for kicks).

I tried a full data.table-based rewrite for that junctionCounts() code, speed can be improved a lot but the memory exploded badly on my first naive attempt. Andrew's smart use of RLE vectors there did a good job at keeping memory usage under control somewhat; the only improvement I can imagine (besides replacing read.delim with fread, that's just a speed gain) would be switching to full blown incremental dgCMatrix building (and also use that in the resulting RSE).

Nick-Eagles commented 2 years ago

Ah, I totally forgot about that conversation... so just to make sure, do you mean we should modify junctionCount to only have the behavior seen when strandSpecific=TRUE, or that SPEAQeasy should always pass TRUE?

I'll make an issue on the jaffelab repo for the skip problem.

gpertea commented 2 years ago

I guess the safest/most conservative take right now would be to always pass TRUE from SPEAQeasy. Perhaps when we are ready to improve/rewrite junctionCounts() more substantially we could reevaluate -- are there any use cases where that parameter being FALSE (and setting junction strands to *) makes sense? A full inventory of all possible calls to that function in other code might be needed.. Anyway it looks like fixing that skip=1 bug in read.delim() does involve rebuilding the jaffelab package, right? Unless you want to try an out of package function instead, (renamed and modified junctionCounts() function directly defined in the create_count_objects.R script maybe..).

Nick-Eagles commented 2 years ago

Leo fixed the skip=1 bug, and I updated SPEAQeasy to use TRUE for strandSpecific. I now have SPEAQeasy running.

Nick-Eagles commented 2 years ago

The SPEAQeasy processing is done, with outputs under /dcs04/lieber/lcolladotor/BrainSEQ_LIBD001/brainseq_phase1/SPEAQeasy_reprocess/outputs/.

gpertea commented 2 years ago

Something is still odd with the junction coordinates - again only 544 out of 2,339,867 reported junctions in the rse_jx object match the known Gencode 25 junction coordinates.

(the number should be above 250,000 as it is the case for all the other datasets)

This is actually reported in the Class column of rowData:

> table(rowRanges(rse_jx)$Class)

AltStartEnd    ExonSkip       InGen       Novel 
     739107         226         544     1599990

I suspect an off-by-one error of sorts, somehow caused by recent versions of the pipeline? (or it could be regtools, I am not sure)

Nick-Eagles commented 2 years ago

Weirdly enough, ~500,000 junctions have their end coordinates match GENCODE, but very few (a few thousand) match by start coordinate. Do you have an idea of the date of the last run of SPEAQeasy producing good junction coordinates (ideally with GENCODE v25, but either way)? I'll try to investigate some more.

gpertea commented 2 years ago

Wow, this actually looks like an "off-by-two" bug for the start coordinate (!) That is, if I add 2 to the start coordinate in rowRanges() then the match stats become somewhat sane (checking the coordinates against the junction database) :

 full match:  288214 junctions
match start:  742556 
  match end:  740417

I initially checked this for all the other 11 "official" RNAseq datasets, none of them have this issue. I guess the most recent was processed some time in 2020?

The first dataset where I encountered this was the BSP1 rse_jx that was rebuilt in March this year. Meanwhile I processed some other smaller datasets but did not think to check the junctions -- I just did now and they have the exact same problem (increasing the start coordinate by 2 also produces normal stats..). Even though I am using an older/custom SPEAQeasy branch (still with R 3.6.1), I am using the same JHPCE modules for regtools and likely the same jaffelab R package (?) that you have been using.

Anyway this is now a separate issue, unrelated to BSP1 reprocessing -- I guess this issue should stay closed and we should start a new one to clarify this.

gpertea commented 2 years ago

Just to have it documented here, I fixed the start coordinate for rse_jx and added pheno data to all the newly generated RSE files, saved here:

/dcs04/lieber/lcolladotor/BrainSEQ_LIBD001/brainseq_phase1/SPEAQeasy_reprocess/RSE_fixed

There is a README in there explaining why there was an additional rse_jx file (with more stringent filtering of novel junctions, so only ~1 million junctions are in that file, as opposed to the original one which has > 2.3 million)