Closed jwinter6 closed 7 years ago
Have you considered using cutadapt for extraction? I tend to use a command like:
cutadapt -g TCTTGTGGAAAGGACGAAACACCG...GTTTTAGAGCTAGAAATAGCAAGT -e 0.05 --discard-untrimmed -o trimmedGuides.fastq.gz aSample.fastq.gz
for GeCKO V2 samples. By specifying both flanking sequences fully, it also allows for a new kind of QC plot (or table); gRNA length distribution. We notice a small percentage of inserts that are typically 30 to 40 nucleotides long and the extracted sequence corresponds to 2 gRNAs fused together (so unlikely to work in the cell as intended). Currently, the default regular expression is good because it discards these inserts, but there's an opportunity to create a QC metric based on them.
Hi Dario,
thanks for the hint with cutadapt. Since we do not see multiple sgRNA sequences (which seems to be a cloning issue) I have not included anything like that. Currently, we test the new implementation which is based on the RUST language, and it seems to give an overall speed boost by 10-13x (even faster than C).
Interesting; I didn't know about RUST.
Implemented in the 1.16BETA, will be the new default with a fallback to the old PERL scripts. Speed improvement by 14x, so now the slowest operation is the bowtie2 mapping :)
For more information regarding the FASTQ extraction and SAM extraction:
https://github.com/OliPelz/fastq_extractor_proof_of_principle/tree/master/extractor_in_RUST
Dear all, currently we work on increasing the speed of the FASTQ data extraction and mapping. We expect a speed increase by 4-9x compared to the current implementation.
I will close this once it is implemented and released.
Best Jan