Phasebam updated - Githubissues

This is an updated version of the PR Tobias posted (#3) in which i have done some edits. Some are rather substantial so i decided to open a new PR.

The basis is the same script but include several edits

In get_phased_snvs_hapcut2_format i have factored out PhaseBlockReader for reading the hapcut2 phaseblock files and also made a class PhasedVariant to handle the variants more easily.
I changed the filters so that currently only SNVs are used for phasing.
translate_positions() have been changed to use pysam's get_aligned_pairs(). I found that in some cases variants at the edges of reads were mislabeled with haplotypes in the PR #3 script but it was hard to find the error using the get_blocks()method so I switched to correct the issue.
phase_read() now uses the base quality for scoring haplotypes.
I added an option for how to prioritise which haplotype to adding to a read when you have contradicting information from the barcode (molecule) and read level. Previously this was based on consensus meaning that no reads carrying contradicting haplotype information would be tagged. Now I have changed in to tag based on read information by default, based on that this also the case in whatshap haplotag but it could be something to test further.

I also compared our tagging to the output of whatshap haplotag and found the following:

phasebam uses the assigned molecule index (tag MI) to distribute haplotype over molecules while whatshap has to recalculate the 'read clouds' (or molecules as we refer to them). This means that some of the filtration using filterclusters is no longer effective for the whatshap tool as the filtration keeps the barcode but changes the MI tag for -1. Thus some reads within barcodes that we don't want to tag will still be tagged. One easy way to solve this would be to remove these barcodes entirely in the filterclusters step.
whatshap can handle other variants than SNVs.
whatshap have two mode to identify which haplotype a read belongs to. Possibly we should think about implementing one of them in our script.
- If a reference is supplied then realignment of the read sequence to the reference is performed.
- If no reference is given the cigar string is used to define the variant in the read.
pruned variants are still used by whatshap for tagging.

Comparison of % reads tagged

Tool	%tagged
phasebam no_filter	49.4
phasebam default	48.3
haplotag 30k	47.0
haplotag 50k	48.9

_phasebam: "nofilter" means run using --include-pruned --min-switch-error 0 --min-missmatch-error 0. "default" means run using default settings haplotag: Run using two different values for --linked-read-distance-cutoff "30k" for 30000 (same as in default blr configs) and "50k" for 50000 (haplotag default).

Really awesome job, really like the pysam module you found!

Some comments;

I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like

How many reads are affected?
Manual inspection of some randomly chosen reads.
Do they seem to be allocated to specific kinds of regions?

I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think?

With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right?

Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already.

I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information?

@FrickTobias

I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like

How many reads are affected?

In the chr22 dataset, for about 7% of the reads having both read and molecule phase info, the phasing info is contradictory.

Manual inspection of some randomly chosen reads.

I looked into some cases where the read and molecule haplotypes were different and they made sense based on how the program currently opperates and with respect to the variants overlapped by the read for those barcodes.

Do they seem to be allocated to specific kinds of regions?

Not really. I binned the discordant read position along with the phased variants and total coverage and they all overlap pretty well (See figure below).

I think it will be hard to find a the best way to label these reads. How do you weigh read and molecule evidence against eachother? READ priority is what is preformed by whatshap haplotag but I am not sure why. On a read level this makes sense since the variants for that read is clearly matched to a particular haplotype. However the fact that the read has a different haplotype to the molecule is a sign that it either (1) was not assigned to the correct barcode/molecule or (2) we have a switch error for this variant. I am also uncertain of the actual effects this will have downstream i.e. how the haplotype is used for SV calling. Possibly we just have to try the different option out and see how they affect the results. For now however I guess the safest choice is to remove these entirely from the analysis (e.g running CONSENSUS mode). Also as it affects only ~7% of reads with both read and mol haplotype, it is a relatively small loss.

I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think?

Yeah I think this is probably a good idea. It seams like we keep running into issues like this and then is just simpler to clear out the information altogether.

With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right?

I am not entirely sure what you mean here? Do you mean to use the phased VCF rather than the HAPCUT2 file for the variants?

Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already.

Yeah I also think this might be too much work for now.

I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information?

It does not use molecule information, rather the BX tag and a window of a desired size (much like in buldmolecules).

Alright, looks good! Although 7% is quite a lot I'm not sure what else we can do currently, other than removing the information.

What do you think about saving the stripped read names or positions to a separate file or something where we strip the information? Then we could also look into that to see if we can later correlate it to switch error rates, or at least investigate if they seem related. This way we could strengthen our idea about stripping the information.

Otherwise everything looks good!

@FrickTobias I am not sure if 7 % percent is that much in this case. Note that these are the reads with both phasing information from the molecule and the read which in turn make up about ~12% of all reads.

Regarding storing the read names, you implemented this "anomaly" file in your script which I have kept. This outputs a TSV file which has the read names of the remove reads. They are marked with mol_hap!=read_hap in the second column of this file. I guess this should be sufficient?

AfshinLab / BLR

Phasebam updated #4