cnobles / iGUIDE

Bioinformatic pipeline for identifying dsDNA breaks by marker based incorporation, such as breaks induced by designer nucleases like Cas9.
https://iguide.readthedocs.io/en/latest/
GNU General Public License v3.0
20 stars 9 forks source link

Read enrichment at chromosome centromere #81

Closed ShanSabri closed 3 years ago

ShanSabri commented 3 years ago

I notice that the iGUIDE alignment has a very strong enriched of reads at each chromosome's centromere. Is there a reason for this? How are theses regions filtered out and excluded from off-target events? I notice many of these reads have low MAPQ so I was wondering if there is some filtering step prior to computing on-/off-targent events.

CHR2 (on-target to left of centromere, off-target to right of centromere, strong enrichment at centromere): image

CHR3 (no on-/off-targets but strong enrichment at centromere): image

EDIT: Alignment is preformed using BWA with these params: -k 30 -w 2500 -P -L 25 -a

Also:

# Post-alignment filtering
maxAlignStart : 5
minPercentIdentity : 95
minTempLength : 30
maxTempLength : 2500
cnobles commented 3 years ago

That's a great find, @ShanSabri . I would venture to bet that these are poor alignments, as you said they have low MAPQ scores. I believe there is a parameter for bwa that allows you to filter results on MAPQ, you may want to take advantage of that. You are using the '-a' flag for bwa, which will return all alignments. So, imagine the sequence around the centromeres being poly-N, then you'll have a good alignment for a read, and bad alternative alignments at the centromere, so you'll have to adjust the filtering params to exclude these. Are the above images from the output alignments (bam files) or after the post alignment filtering?

cnobles commented 3 years ago

My way of approaching these multiple alignments is to make sure the alignments are valid / believable, and then if there are more than one, it would be a multihit. Any sequence can be aligned to multiple places of the genome, aligned well is a different story. The Post-alignment filtering is what I've used in the past, but I can see how that would still allow for poly-N alignments. You've pointed out that the MAPQ scores are low, but any secondary or alternative alignment will have a low MAPQ by definition of the metric. I'm trying to think of a way to exclude alignments in regions with N. Maybe you know of a way to do this, but I'll look into it a little bit, first thoughts are around masking the ref genome.

cnobles commented 3 years ago

Quick searches on bwa and ambiguity codes suggest that these may not be random poly-N alignments. If you drop the '-a' flag, do these alignments go away? You could even raise the clipping penalty (-L) too higher, sometimes I've used 99. These types of alignments do not need clipping to be applied, so it's preferred if the software did not consider clipped alignments. This is because we are removing the adapters prior to alignments, and therefore all we should have left is genomic sequence. But it's your call.

ShanSabri commented 3 years ago

Hi @cnobles - the images above are taken using the BAMS from process_data/align. I've merged the binned BAMs together to get these tracks. I pileup at the centromeres are actually low MAPQ and I can definitely filter these out by masking the genome and excluding these repetitive regions. Thanks for the insight!

I'll have to follow up on excluding -a.

ShanSabri commented 3 years ago

So it seem there are in fact a few loci that overlap with significant peaks (where reads are MAPQ > 30) but are not identified as iGUIDE targets. I'm not totally sure why this is the case. Any ideas?

example regions: image

EDIT: To clarify, this locus is not near a centromere. I've overlapped these peaks with all iGUIDE identified targets (extended +/- 100bp) and these do not overlap.

cnobles commented 3 years ago

Hi @ShanSabri, yes this is expected, and is linked to biological signal. We need to remember that this assay focuses on measuring DSBs through NHEJ of our oligos. Which means that if DSBs were caused by something other than the tested nuclease, or are off-target activity of the nuclease that is not captured by simply looking for a sequence homologous to our target site upstream of the identified DSB. There are many DSBs that occur during cell cycle, independent of RNA-guided nuclease activity. These are commonly repaired by the DNA repair pathways, but are also locations to be captured by the iGUIDE or any other GUIDEseq-like assay.

The location you picked was one that I recognize, the gene to the far right of your image (DAD1) I recall is very close to another gene locus that is not annotated in your image, the TRA locus, part of the T-cell receptor, and is the area of the genome (chr14) where VDJ recombination occurs. Any chance you have some T-cells in your sample?

cnobles commented 3 years ago

As a note, in the final report there is a circular Manhattan plot that shows the coverage density of different types of alignments. The most inner circle is all alignments where oligos were identified, then pileups, then sites where flanking alignments were identified, and finally the sights where gRNA similar sights were identified. If you get comfortable looking at these plots, it's pretty easy to pick out or scan for regions where there may be some off-target or non-nuclease associated activities in your sample.

ShanSabri commented 3 years ago

Hi @cnobles -- Thanks for getting back to me. Yes, there are T-cells in these samples. I'm looking at the Manhattan plots and do notice some pile ups on Chr14. Can you clarify exactly what filtering criteria is needed for reads to be assigned to each ring? It it correct to assume the the outer rings are a subset to the inner rings? I see that there are some cases where there are no flanking pairs (blue ring) but there are matched alignments (red ring) at a given locus. The matched alignments are those that are quantified in the iGUIDE output, correct?

image image image

cnobles commented 3 years ago

Correct on your last point, that only the red outside ring are the sites considered to be on / off-target for iGUIDE.

Each ring (after the inner ring) is a subset of the inner ring. For the pileup ring (green), there needs to be distinct alignment pileups of at least 3 (default, can change in configuration) to be considered. For the flanking pairs ring (blue), there needs to be alignments found on opposite strands, upstream from one another, and within 2x the distance considered for gRNA site matching (default is 100 bp, so within 200bp of each other). For the gRNA matched ring (red), alignments need to have a similar gRNA sequence (within mismatch error, also configurable) within a certain distance upstream of the identified site (this distance is also configurable). Each of these factors can be modified in the config file, but also have explanations in the documentation.

All of the information can be reviewed in the output data as well, let me know if there is something specific you are looking for and I can try to direct you towards it.

These additional methods of looking for common DSB sites within the sample are a way to assess if the conventional gRNA matching methods are inclusive of all "true" off-targets. I would recommend looking into a few of the more prominent pileups and flanking paired regions to see if there could potentially be nuclease activity or if biology may explain these. Another good way to determine if these sites are true biological signal is to observe them over several different experimental samples.

ShanSabri commented 3 years ago

Thanks so much for all the info!