JaneliaSciComp / msg

Multiplexed Shotgun Genotyping
http://genomics.princeton.edu/AndolfattoLab/MSG.html
11 stars 12 forks source link

Add support for paired-end read alignment #47

Closed fdchevalier closed 2 years ago

fdchevalier commented 2 years ago

Hi MSG team,

As mentioned in the previous issues (#46), I am using paired-end sequences. The alignment step treats these sequences as single-end sequences. Alignment accuracy is increased when considering them truly paired-end instead of single-end. As paired-end sequencing is now common, it would be relevant to take advantage of this and adapt the pipeline.

When playing with the pipeline, I identified scripts that required updates to achieve this. As I was using bwa mem, I added -p option to the bwa mem lines. https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/msg.pl#L312 https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/parse_and_map.py#L626 https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/parse_and_map.py#L632

This was just adapted to my needs. It will require more work if implemented: the other methods will need to be adapted too and single-end read alignment will need be retained.

The uses of paired-end alignment option has also consequences on the flags assigned to reads in SAM files. Instead of the 0, 4 and 16, a lot more flags must be considered. The only script that I think is affected is extract-ref-alleles.py. Below are modifications I had to make. These are compatible with single-end reads as 0 and 16 are retained.

https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/extract-ref-alleles.py#L285

should be replaced with:

fwd_flags = set([0,65,67,73,75,97,99,105,107,129,131,137,139,161,163,169])
rev_flags = set([16,81,83,89,91,113,115,121,123,145,147,153,155,177,179,185,187])
omit_flags = fwd_flags.union(rev_flags) #I believe these are actually "don't omit" flags

https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/extract-ref-alleles.py#L332 should be replaced with:

if (par2_flag in fwd_flags and par1_flag in rev_flags) or (par2_flag in rev_flags and par1_flag in fwd_flags):

https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/extract-ref-alleles.py#L578-L579 should be replaced with:

fwd_flags = set([0,65,67,73,75,97,99,105,107,129,131,137,139,161,163,169])
rev_flags = set([16,81,83,89,91,113,115,121,123,145,147,153,155,177,179,185,187])
revComp_flag = 0
if (sim_read.flag in fwd_flags and read.flag in rev_flags) or (sim_read.flag in rev_flags and read.flag in fwd_flags):

I understand that this would require some efforts to upgrade the pipeline and is of a lesser priority compared to the previous issue. Anyway, if work is done on this aspect, I would be happy to make tests if needed.

dstern commented 2 years ago

Unfortunately, we are not currently updating this code and are not maintaining compatibility with newer versions of dependencies. You are welcome to branch the repo and make changes suitable with your needs. In practice, we find that single end reads are more than sufficient to infer ancestry in most cases.