boutroslab / CRISPRAnalyzeR

CRISPRAnalyzeR: interactive analysis, annotation and documentation of pooled CRISPR screens
GNU General Public License v2.0
80 stars 33 forks source link

Increased speed in FASTQ extraction and Mapping #20

Closed jwinter6 closed 7 years ago

jwinter6 commented 7 years ago

Dear all, currently we work on increasing the speed of the FASTQ data extraction and mapping. We expect a speed increase by 4-9x compared to the current implementation.

I will close this once it is implemented and released.

Best Jan

DarioS commented 7 years ago

Have you considered using cutadapt for extraction? I tend to use a command like:

cutadapt -g TCTTGTGGAAAGGACGAAACACCG...GTTTTAGAGCTAGAAATAGCAAGT -e 0.05 --discard-untrimmed -o trimmedGuides.fastq.gz aSample.fastq.gz

for GeCKO V2 samples. By specifying both flanking sequences fully, it also allows for a new kind of QC plot (or table); gRNA length distribution. We notice a small percentage of inserts that are typically 30 to 40 nucleotides long and the extracted sequence corresponds to 2 gRNAs fused together (so unlikely to work in the cell as intended). Currently, the default regular expression is good because it discards these inserts, but there's an opportunity to create a QC metric based on them.

jwinter6 commented 7 years ago

Hi Dario,

thanks for the hint with cutadapt. Since we do not see multiple sgRNA sequences (which seems to be a cloning issue) I have not included anything like that. Currently, we test the new implementation which is based on the RUST language, and it seems to give an overall speed boost by 10-13x (even faster than C).

DarioS commented 7 years ago

Interesting; I didn't know about RUST.

jwinter6 commented 7 years ago

Implemented in the 1.16BETA, will be the new default with a fallback to the old PERL scripts. Speed improvement by 14x, so now the slowest operation is the bowtie2 mapping :)

For more information regarding the FASTQ extraction and SAM extraction:

https://github.com/OliPelz/fastq_extractor_proof_of_principle/tree/master/extractor_in_RUST