Add support for paired-end read alignment

Hi MSG team,

As mentioned in the previous issues (#46), I am using paired-end sequences. The alignment step treats these sequences as single-end sequences. Alignment accuracy is increased when considering them truly paired-end instead of single-end. As paired-end sequencing is now common, it would be relevant to take advantage of this and adapt the pipeline.

When playing with the pipeline, I identified scripts that required updates to achieve this. As I was using bwa mem, I added -p option to the bwa mem lines. https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/msg.pl#L312 https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/parse_and_map.py#L626 https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/parse_and_map.py#L632

This was just adapted to my needs. It will require more work if implemented: the other methods will need to be adapted too and single-end read alignment will need be retained.

The uses of paired-end alignment option has also consequences on the flags assigned to reads in SAM files. Instead of the 0, 4 and 16, a lot more flags must be considered. The only script that I think is affected is extract-ref-alleles.py. Below are modifications I had to make. These are compatible with single-end reads as 0 and 16 are retained.

https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/extract-ref-alleles.py#L285

should be replaced with:

fwd_flags = set([0,65,67,73,75,97,99,105,107,129,131,137,139,161,163,169])
rev_flags = set([16,81,83,89,91,113,115,121,123,145,147,153,155,177,179,185,187])
omit_flags = fwd_flags.union(rev_flags) #I believe these are actually "don't omit" flags

https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/extract-ref-alleles.py#L332 should be replaced with:

if (par2_flag in fwd_flags and par1_flag in rev_flags) or (par2_flag in rev_flags and par1_flag in fwd_flags):

https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/extract-ref-alleles.py#L578-L579 should be replaced with:

fwd_flags = set([0,65,67,73,75,97,99,105,107,129,131,137,139,161,163,169])
rev_flags = set([16,81,83,89,91,113,115,121,123,145,147,153,155,177,179,185,187])
revComp_flag = 0
if (sim_read.flag in fwd_flags and read.flag in rev_flags) or (sim_read.flag in rev_flags and read.flag in fwd_flags):

I understand that this would require some efforts to upgrade the pipeline and is of a lesser priority compared to the previous issue. Anyway, if work is done on this aspect, I would be happy to make tests if needed.

JaneliaSciComp / msg

Add support for paired-end read alignment #47