boutroslab / caRpools

Exploratory data analysis for pooled CRISPR/Cas9 screens
http://www.crispr-analyzer.de
21 stars 13 forks source link

Recommendation of Bowtie 2 Mapping Should Be Changed #6

Closed DarioS closed 7 years ago

DarioS commented 7 years ago

The vignette recommends the usage of Bowtie 2 to map gRNAs to the library sequences. However, there are a number of issues on Bowtie 2's tracker highlighting that it maps the sequences to the wrong reference. Also, the Bowtie 2 homepage states "Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.". It seems that Bowtie 2 wasn't intended for mapping to short reference sequences, such as those present in CRISPR experiments, and produces incorrect matches if used for that purpose. It's worthwhile to look inside the SAM file and check a few matches, like I have done, to convince yourself the Bowtie 2 mappings are often wrong for CRISPR analysis.

jwinter6 commented 7 years ago

Hi Dario, CaRpools was developed to use bowtie2 and checks each individual mapping for the cigar string within the same file. By making sure both the reference and the to-map NGS read have the same length, I force bowtie2 to a fixed length alignment and did not observe any mistakes in mapping using bowtie2 for our datasets so far. However I will check again. According to the bowtie2 issue that you started, you map a read to a reference that has a different length, which can be problematic as far as I experienced it. caRpools will always make sure it is the same length. Though bowtie2 was developed for long reads, it shows a great performance for our screens with this specific setting.

DarioS commented 7 years ago

Even if I use only 20 nucleotide sequences in the input file, Bowtie 2 is making mistakes. In the SAM output, please check the CIGAR strings which are unusually long (e.g. 3M1I6M1I3M1I2M1I2M). Those matches are typically incorrect when examined manually. For example, the sequence ATGATATAGACTTCATCTCT matches HGLibA_05931 with 2 mismatches when done manually. However, the alignment found by Bowtie 2 is:

Extracted          1 ATGATATAG--ACTTCATCTCT     20
                     ||| |.|||  |||.|..||| 
Bowtie2            1 ATG-TCTAGATACTCCTCCTC-     20

Hence, the weird CIGAR string.

The best match I found with my text editor's Find tool is:

Extracted          1 ATGATATAGACTTCATCTCT     20
                     .||||||||||||||||||.
VIM                1 TTGATATAGACTTCATCTCC     20

This demonstrates that Bowtie 2 could often do better. For the reads with CIGAR 20M, the mapping by Bowtie 2 is always correct.

jwinter6 commented 7 years ago

Hi Dario, thanks for your comment and your hints. Unfortunately I cannot find anything like you describe in the SAM files generated with Bowtie2 and extracted FASTQ files with our datasets.

What I wonder: do you also use 20 nt long sequences in your FASTA file you use for mapping?

Because the extracted sequence ATGATATAGACTTCATCTCT is not the same as the one it was aligned to ATGTCTAGATACTCCTCCTC, which is totally weird. The same is for your best match. There is only 2 bases difference (one in the beginning, one in the end). This is not the case when you design sgRNAs, they should never be that similar. So I wonder if you really mapped the right files.

Usually, I extract the sgRNA target sequence (20nt in length) from FASTQ files and use these to map it with bowtie2 (very-sensitive-local setting) to a Fasta library that also contains only 20nt long sgRNA target sequences (as proposed for caRpools).

What I get is the following (in case an alignment was found):

M01100:33:000000000-A9U6C:1:1101:19690:4920 0 ENSG00000105221_GCCAGACGAGAGGTCAGTCT 1 44 20M * 0 0 GCCAGACGAGAGGTCAGTCT GGGGGHGGGGGGGGHHHHHH AS:i:40 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:20 YT:Z:UU

which indicates a full match of all 20 nts (CIGAR String is 20M). I just checked some screening datasets and if there is an alignment, it is always a full match alignment of 20M - nothing else. Since I use the --very-sensitive-local flag, I don't know if there is a huge difference to the default.

For all the screening data I have at hand here, mapping using bowtie2 always gives great results without any weird mappings.

Maybe you have some issues with your sequencing, which results in sequencing errors on a single-basepair level. In case there are sgRNA that just differ by 2 bases (like for your second example), a non-fullmatch mapping is not used anyway, so it is not important if the alignment is not perfect. You are right that bowtie2 could do better in your case, but caRpools ignores mappings like this so that it should not influence data analysis. As I said, this never happened for any of our datasets I checked.

If you like you can write me an email to jan.winter@dkfz-heidelberg.de and we can discuss this more closer using your data.

Since this is a bit off-topic (though very interesting!), I close the issue.

Best Jan