Closed FrickTobias closed 4 years ago
(First has to rebase, working on that)
This sounds a bit like whatshap haplotag, is that similar?
Yes, but it's not too specific on how it handles any conflicts if the molecule phase is not the same as read phase, or what happens if one molecule has support for multiple haplotypes.
When using 10X Genomics BAM files, haplotag reads the BX tags and per default assigns reads that belong to the same read cloud to the same haplotype. This feature can be switched off using the --ignore-linked-read flag.
I also believe Pontus had looked at this a bit, but hadn't been able to run it on our data.
I also believe Pontus had looked at this a bit, but hadn't been able to run it on our data.
I have tried this again and now I can make it work. I will integrate this into the pipe and then add the custom phasebam
script. The custom script has more flexibility for us and also works for BAM phased using a reference VCF (unlike haplotag
).
Closing this in favor of the updated PR
This introduces the submodule
blr phasebam
which phases BAM files based on the phased variants by giving them aHP
(haplotype, 1 or 2) andPS
(phase set, int) tag. Here all haplotype 1 or 2 will be the same continuous stretch of DNA as long as they are in the same phase set, where a new phase set could mean 1 and 2 has switched.Currently it seems to perform around what I've seen Longranger perform on similar data, meaning that around 50% of the final reads. Notably it does not handle phased SVs yet (including 1 bp deletions/insertions), but a quick counting shows SNVs represent ~93% of the total amount of variants so it should be using most of the information. Of course, this should probably be implemented in the future though.
It's also worth to point out the testdata was chr22 on the 191013 analysis dataset (see UPPMAX).
TL;DR:
blr phasebam
HP
andPS
tag intophase.bam
filelen(ref) != len(var)
Results
NB: The HapCUT2 file contains phasing results from full dataset, hence the high amount of variants.