bmvdgeijn / WASP

WASP: allele-specific pipeline for unbiased read mapping and molecular QTL discovery
Apache License 2.0
103 stars 51 forks source link

Many reads discarded due to remapping with different CIGAR #101

Closed hrayjones closed 4 years ago

hrayjones commented 4 years ago

Hello,

I have a query about the filter_remapped_reads part of the mapping pipeline. Of my discarded reads after remapping, the vast majority were remapped with a different cigar. For example, "discard_reads: 2182 (of which 1798 remapped with a different cigar)".

  1. Is the cigar taken from the "MD" optional flag in the bam file? I can see that this varies after flipping the allele and remapping. Usually, I can see one base that differs from reference; presumably this is my SNP. My cigar string in the required bam fields seems to remain invariant (e.g. 150M).

  2. If so, I am trying to understand why the pipeline filters reads based on a change in the MD flag as well as read position? As surely, after flipping the allele there will usually be one base mismatch in the alignment. I can see that the position of the read doesn't change, and the flag still seems to add up to the correct number of bases (e.g. MD:Z:150 becomes MD:Z:94A55). I can only think that this has the aim of completely removing potential mapping bias at the cost of reads.

I'd really appreciate if you could help my understanding of this.

Many thanks, Helen

gmcvicker commented 4 years ago

The CIGAR strings are obtained from aligned reads using the cigarstring attribute in pysam: https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.cigarstring

As far as I know these are not taken from the MD flag in the BAM file.

You are correct that introducing other alleles will typically introduce an alignment mismatch, however CIGAR represents both mismatches and matches with an 'M'.

It is safer to remove reads that have different CIGAR strings, because they may contain an indel or clipping that causes one allele to be missed, when counting which alleles are present.

Hope this helps,

Graham

hrayjones commented 4 years ago

Hi Graham,

Thank you for your response; that makes perfect sense.

In this case, though, I am still confused as to why the majority of my re-mapped reads are thrown out based the cigar string. To try to figure it out, I generated a set of discarded reads by filtering the "remapped.bam" file based on the IDs not found in the "keep.bam" file (since as far as I can tell, WASP does not produce a "discarded.bam" file). I obtained the cigars of a random subset of the discarded reads and they all matched, both to the original bam file and the to.remap.bam file. The coordinates also matched. I did find, however, that the to.remap.bam file would not load into R using Rsamtools scanBam - I get the error "failed to open BamFile: 'filename' is not a BAM file". However, based on the fields in the file I cannot see why it would not be a real bam file.

My data may be an unusual case because it is Hi-C data and therefore the paired reads do not always map to the same location in the genome. However, I have removed the possibility of chimeric reads by using a program HiCUP, which truncates the reads at the HiC ligation junction. Therefore, my paired end reads should behave as any other (ATAC seq, ChIP seq). One thing I have noted is that, prior to filter_remapped_reads.py, the remapped bam file is sorted by coordinate. This means that my pairs no longer appear one after the other in the file. Can this possibly be causing an issue with filter_remapped_reads.py?

Thanks very much for your help so far; please let me know if you have any insights on this.

Best wishes, Helen

hrayjones commented 4 years ago

Hi Graham,

Just an update - I figured out the reason why my reads are discarded. It is because, for some reason, when my data are re-mapped the "read 1" and "read 2" designations can be swapped around in comparison with the first mapping. This is reflected in the bitwise FLAG that pysam uses to assign R1 and R2. Therefore, now the CIGAR of R1 matches that of the previous R2 and vice versa.

Many thanks, Helen

gmcvicker commented 4 years ago

Hi Helen,

Thank you for the update. I meant to reply to your message, but had still not figured out what the issue was. I'm glad you figured it out.

Graham

gbloeb commented 3 years ago

@hrayjones Did you figure out why the read 1 and read 2 designations were being swapped and a work-around? I'm having the same issue.
Thanks! Gabe

hrayjones commented 3 years ago

@gbloeb Apologies for not replying sooner - I think the script has been updated to allow for the possibility of reads 1 and 2 being swapped around (thank you @gmcvicker !)

Cheers, Helen