New preprocessing release

vladsavelyev commented 6 years ago

First, wanted to thank you for making a 10x alignment tool that outperforms Lariat, and for publishing all the computation scripts you used in a paper in a notebook. We are in the University of Melbourne Center for Cancer Research evaluating applications of 10x for WGS somatic variant calling, and your notebook serves a fantastic reference for evaluation on our internal cancer samples.

Currently for our runs we rely on LongRanger for alignment. However, we are having quite a bit of technical issues running it on cluster, and would be happy to evaluate EMA instead and consider integrating it into https://github.com/chapmanb/bcbio-nextgen. However, at the moment we weren't able to get EMA to the end, except for small tests: the preprocessing step has been taking quite a substantial amount of time so far (20 hours for a NA12878 WGS sample on 30 threads, however it's using only one core anyway as far as I can see). I'm not sure if things are going normal or got stuck. In any case, I understand that you have a more efficient preprocessing method for internal use. I wondering if you have any estimates by any chance when you could release it?

Vlad

inumanag commented 6 years ago

Hi Vlad,

Thanks for using EMA!

I hope to finalize pre-processing part by the end of this week. Currently it's working, but it uses quite a lot of RAM (we just wrote a quick prototype to speed up our own alignments), and it will take me few days to properly integrate the code into EMA.

vladsavelyev commented 6 years ago

That sounds awesome, looking forward to it!

arshajii commented 6 years ago

Thanks for your patience. We've just incorporated the faster preprocessing code into EMA. More details, including the end-to-end workflow, are in the new README. (We'll add a new wrapper script to reflect the changes soon.) I'll close this issue for now, but please let us know if you run into any issues, or have any other questions/suggestions!

vladsavelyev commented 6 years ago

Wow, thats awesome! Thank you for your work. Will try it out first thing tomorrow.

inumanag commented 6 years ago

Hi Vlad,

some numbers: on NA12878 dataset (64GB gzipped), pre-processing takes 27 minutes on 40 cores for gzipped sample (2 mins for counting and 25 mins for correcting). If you want H2 correction (as in paper), add extra 15 minutes to the correction step (although impact of it is negligible, but if you want 100% compatibility with Long Ranger fo with this). In total, it should take less than 1hr (even with 20 threads w/o H2, since most of the time is spent on I/O). Afterwards, we also trimmed down the times by plugging in sorting and duplicate marking directly in EMA. All in total, it takes ~10hrs on 40 cores from the initial fastq.gz files to the final indexed, marked and sorted BAM.

vladsavelyev commented 6 years ago

Thanks for the numbers! That's really fast. I've started wondering if it can be made even faster with minimap2 underneath instead of BWA - in our tests it produce very similar alignments, but 1.5 times faster.

arshajii / ema

New preprocessing release #8