AfshinLab / BLR

MIT License
4 stars 0 forks source link

Phasebam updated #4

Closed pontushojer closed 4 years ago

pontushojer commented 4 years ago

This is an updated version of the PR Tobias posted (#3) in which i have done some edits. Some are rather substantial so i decided to open a new PR.

The basis is the same script but include several edits

I also compared our tagging to the output of whatshap haplotag and found the following:

Comparison of % reads tagged

Tool %tagged
phasebam no_filter 49.4
phasebam default 48.3
haplotag 30k 47.0
haplotag 50k 48.9

_phasebam: "nofilter" means run using --include-pruned --min-switch-error 0 --min-missmatch-error 0. "default" means run using default settings haplotag: Run using two different values for --linked-read-distance-cutoff "30k" for 30000 (same as in default blr configs) and "50k" for 50000 (haplotag default).

FrickTobias commented 4 years ago

Really awesome job, really like the pysam module you found!

Some comments;

I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like

I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think?

With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right?

Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already.

I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information?

pontushojer commented 4 years ago

@FrickTobias

I'd be a bit careful regarding mislabeling, where my idea is if we do not know it is correct to label a read in one way one should be careful to do it. But of course it is better if it can prioritize it somehow but then we should probably look into the relevant reads to understand what the effects are, so things like

  • How many reads are affected?

In the chr22 dataset, for about 7% of the reads having both read and molecule phase info, the phasing info is contradictory.

  • Manual inspection of some randomly chosen reads.

I looked into some cases where the read and molecule haplotypes were different and they made sense based on how the program currently opperates and with respect to the variants overlapped by the read for those barcodes.

  • Do they seem to be allocated to specific kinds of regions?

Not really. I binned the discordant read position along with the phased variants and total coverage and they all overlap pretty well (See figure below).

image

I think it will be hard to find a the best way to label these reads. How do you weigh read and molecule evidence against eachother? READ priority is what is preformed by whatshap haplotag but I am not sure why. On a read level this makes sense since the variants for that read is clearly matched to a particular haplotype. However the fact that the read has a different haplotype to the molecule is a sign that it either (1) was not assigned to the correct barcode/molecule or (2) we have a switch error for this variant. I am also uncertain of the actual effects this will have downstream i.e. how the haplotype is used for SV calling. Possibly we just have to try the different option out and see how they affect the results. For now however I guess the safest choice is to remove these entirely from the analysis (e.g running CONSENSUS mode). Also as it affects only ~7% of reads with both read and mol haplotype, it is a relatively small loss.

I think we've run into enough small problems regarding trying not to remove information so maybe it is better to instead remove all reads/information we do not think is relevant. So for me it would be OK to start removing duplicates/tags etc and instead keep a "non-stripped" file for which the later parts of the analysis can be rerun on. What do you think?

Yeah I think this is probably a good idea. It seams like we keep running into issues like this and then is just simpler to clear out the information altogether.

With the new pysam module and the VCF file information we should be able to incorporate some phasing from variants without a crazy amount of hassle right?

I am not entirely sure what you mean here? Do you mean to use the phased VCF rather than the HAPCUT2 file for the variants?

Although it would be nice to be able to run with a reference, I think it should be lower on the priority list. If you are interested in phasing something I think it would be reasonable to assume most people won't have a reference for that individual already.

Yeah I also think this might be too much work for now.

I have previously not found any information about 'whatshap haplotag' using molecule information at all and is rather working with only read information, have you found out it does use this information?

It does not use molecule information, rather the BX tag and a window of a desired size (much like in buldmolecules).

FrickTobias commented 4 years ago

Alright, looks good! Although 7% is quite a lot I'm not sure what else we can do currently, other than removing the information.

What do you think about saving the stripped read names or positions to a separate file or something where we strip the information? Then we could also look into that to see if we can later correlate it to switch error rates, or at least investigate if they seem related. This way we could strengthen our idea about stripping the information.

Otherwise everything looks good!

pontushojer commented 4 years ago

@FrickTobias I am not sure if 7 % percent is that much in this case. Note that these are the reads with both phasing information from the molecule and the read which in turn make up about ~12% of all reads.

Regarding storing the read names, you implemented this "anomaly" file in your script which I have kept. This outputs a TSV file which has the read names of the remove reads. They are marked with mol_hap!=read_hap in the second column of this file. I guess this should be sufficient?