AfshinLab / BLR

MIT License
5 stars 0 forks source link

Phasebam #3

Closed FrickTobias closed 4 years ago

FrickTobias commented 4 years ago

This introduces the submodule blr phasebam which phases BAM files based on the phased variants by giving them a HP (haplotype, 1 or 2) and PS (phase set, int) tag. Here all haplotype 1 or 2 will be the same continuous stretch of DNA as long as they are in the same phase set, where a new phase set could mean 1 and 2 has switched.

Currently it seems to perform around what I've seen Longranger perform on similar data, meaning that around 50% of the final reads. Notably it does not handle phased SVs yet (including 1 bp deletions/insertions), but a quick counting shows SNVs represent ~93% of the total amount of variants so it should be using most of the information. Of course, this should probably be implemented in the future though.

It's also worth to point out the testdata was chr22 on the 191013 analysis dataset (see UPPMAX).

TL;DR:

Results

NB: The HapCUT2 file contains phasing results from full dataset, hence the high amount of variants.

============================================================
STATS SUMMARY - blr.cli.phasebam
------------------------------------------------------------
HapCUT2 file: Phase blocks                             6,696
HapCUT2 file: Non-phased entries                     150,020
HapCUT2 file: Usable phased sites                  4,410,941
HapCUT2 file: Pruned entries                         120,337
HapCUT2 file: Phased SVs (not used)                    2,174
Total reads                                        4,100,134
Phased reads with molecule info                      298,691
Reads with equal support for both haplotypes          11,051
Phased reads without molecule info                   189,723
Reads not fitting called alleles                      48,676
Moleules with equal support for haplotypes             4,544
Reads with no phasing info                         1,942,739
Read phased from only mol info                     1,392,612
Total reads phased                                 2,086,361
Concordant read/mol phasing                          401,423
Discordant read/mol phasing                           71,034
Read phased from read (no mol info)                  292,326
============================================================

SNV length distribution

FrickTobias commented 4 years ago

(First has to rebase, working on that)

marcelm commented 4 years ago

This sounds a bit like whatshap haplotag, is that similar?

FrickTobias commented 4 years ago

Yes, but it's not too specific on how it handles any conflicts if the molecule phase is not the same as read phase, or what happens if one molecule has support for multiple haplotypes.

When using 10X Genomics BAM files, haplotag reads the BX tags and per default assigns reads that belong to the same read cloud to the same haplotype. This feature can be switched off using the --ignore-linked-read flag.

I also believe Pontus had looked at this a bit, but hadn't been able to run it on our data.

pontushojer commented 4 years ago

I also believe Pontus had looked at this a bit, but hadn't been able to run it on our data.

I have tried this again and now I can make it work. I will integrate this into the pipe and then add the custom phasebam script. The custom script has more flexibility for us and also works for BAM phased using a reference VCF (unlike haplotag).

FrickTobias commented 4 years ago

Closing this in favor of the updated PR