bluenote-1577 / flopp

flopp is a software package for single individual haplotype phasing of polyploid organisms from long read sequencing.
33 stars 7 forks source link

Interpret flopp results #3

Closed OmarOakheart closed 2 years ago

OmarOakheart commented 2 years ago

Hi,

I'm trying to understand how to interpret results produced by flopp, I input a bam file of minion long reads and a vcf file based on short reads of the same sample (diploid):

$floppPath -b $bamFile -v $vcfFile -p $ploidy -o $outPath -t $threads

I'm a little confused by some of the results, the first few lines are:

**NC_001224.1** 1:92 1 1 0:30|1:30 0:2|1:23

2:116 1 1 0:77|1:7 0:32|1:5

3:125 1 1 0:37|1:60 0:14|1:28

4:172 1 1 0:1|1:98 1:44

Should I interpret that, for the first four heterozygous SNPs at positions 92, 116, 125 and 172 both haplotypes are the same, always the alternate allele?

In the fourth and fifth column for the second line, am I correctly understanding that for the first haplotype block (column 4), there are 77 reads supporting the reference allele (0) and only 7 reads supporting the alternate allele? If that's the case, any idea why the prediction is 1 in the 2nd column instead of 0? Some of the other lines and columns are confusing me for the same reason.

Finally, I noticed that some lines have a -1 and NAs, what does this mean? Here are some examples:

25:2535 -1 -1 0:72 0:26 66:16861 -1 -1 NA NA 149:20370 -1 -1 0:45 0:3 198:24969 -1 -1 0:10 0:2 277:45307 -1 -1 0:65 0:125 417:75412 -1 -1 0:62 0:117 421:75843 -1 -1 0:11 0:21 1:5802 -1 -1 NA NA 2:5803 -1 -1 NA NA 7:11554 -1 -1 NA NA

Best, Omar

bluenote-1577 commented 2 years ago

Hi Omar,

Thanks for raising this issue. Your interpretation is correct, and some of the results are indeed strange. What is happening here is probably something about genotyping constraints with the -v option (and possibly some bugs).

With the -v option, what happens is that the haplotypes are constrained to have the same genotype as in the vcf. For example, if your vcf states that the allele at 2:116 is 1/1, then flopp will force the output haplotypes to have the same genotype, even if the resulting mappings don't agree with the VCFs genotyping. I can see this being an issue if you used a different sample to generate the vcf and not the current sample you're haplotyping with.

I am not sure what is happening at the 277:45307 -1 -1 allele though.

What's the VCFs genotype at 2:116? Similarly 277:45307, could you provide the VCF information for this allele?

To circumvent the genotype constraints, you can use the -c option as a drop-in replacement for the -v option. This does not force the output haplotypes to have the same genotypes as in the VCF file.

Thanks,

Jim

OmarOakheart commented 2 years ago

Ah right, I understand, so flopp will take the VCF at its word even if the bam contradicts it. That's definitely what seems to be happening here:

NC_001224.1 116 . G C 4450.06 . AC=2;AF=1.00;AN=2;BaseQRankSum=2.330;DP=113;ExcessHet=0.0000;FS=6.635;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=24.57;ReadPosRankSum=-0.838;SOR=1.040 GT:AD:DP:GQ:PL 1/1:6,106:112:67:4464,67,0

My short reads and long reads were generated from the same sample, would you recommend using a VCF based on the short reads as input to flopp with the "-c" option, since (I believe) short read variant calls are usually more accurate, or would you recommend ignoring the short read data and variant calling the mapped long reads using longshot and using that as input without the "-c" option? (Or a third option I haven't thought of?)

Here are some lines with "-1":

NC_001134.8 29377 . A C 1637.06 . AC=2;AF=1.00;AN=2;DP=47;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.84;SOR=1.347 GT:AD:DP:GQ:PL 1/1:0,41:41:99:1651,123,0 NC_001134.8 30013 . T C 54.32 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=30.44;QD=27.16;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:66,6,0 NC_001134.8 35001 . A G 63.32 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.58;QD=31.66;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:75,6,0 NC_001134.8 35485 . C T 93.14 . AC=2;AF=1.00;AN=2;DP=4;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=22.57;QD=23.29;SOR=3.258 GT:AD:DP:GQ:PL 1/1:0,4:4:12:107,12,0 NC_001134.8 35840 . G A 52.32 . AC=2;AF=1.00;AN=2;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=26.16;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:64,6,0

Maybe just caused by these variants existing in the short read VCF but not found in the long reads?

bluenote-1577 commented 2 years ago

Ah yes, I see why -1 -1 is being called now. So -1 is supposed to be called as the consensus allele when no reads cover the allele. NA means that no reads cover the allele. But I was confused why -1 -1 is being called when reads cover the allele.

It turns out it's because the genotypes for the variants at 29377, 30013, etc are 1/1 (in your vcf), but the long-reads are apparently all 0 at the allele; so flopp assigns -1 -1 instead (I don't think it was intended behavior on my end). The alleles are being mapped to, since the column for example 0:65 0:125 implies that the reference allele (0) is found at the site.

So yeah, if your entire VCF has genotypes 1/1 then flopp will just output 1/1 for every allele except if only the 0 allele is found at the site, for which it outputs -1 -1.

As far as variant calling I've had good experiences with longshot for long-read calling, but I'm sure short-read variant calling will perform good as well. Both should work.

If you do variant calling and then phasing with the same sample, I would use the "-v" option. The "-c" option disregards the genotyping in the vcf, so you can also try to use the "-c" option with your current VCF, which (correct me if I'm wrong) seems to generated with another sample or have genotypes mislabelled (it seems like everything is 1/1).

Also, as an aside, if you're doing diploid phasing flopp will work, but I think specially designed diploid phasers like HapCUT2 or WhatsHap may be more suitable?

OmarOakheart commented 2 years ago

I think everything in my example has been 1/1 just because this is a telomeric region and long reads are better at mapping there than short reads, so the variation is diluted in short reads and appears non-existent but long reads reveal that some positions should actually be considered heterozygous. There are SNPs labelled as 0/1 later on in the VCF file. That's a pretty interesting blind spot I'll have to keep in mind in the future though.

So that means a -1 -1 shouldn't be considered to be a break in contiguity. I'll start using the "-c" option, and I think it would be interesting in the future to compare the results between both approaches, with short reads x long reads and with only long reads.

I'm just using a diploid example here since I only wanted to test flopp using a simple case to understand how to interpret its output. Would definitely use a more specialized tool if diploid phasing was the end goal since there are a lot of tricks I'd expect diploid phasing to take advantage of.

I developed nPhase and have been working on a review of polyploid phasing methods. A discussion of some of the trends in how to approach the problem and about the performance metrics and datasets that have been used, which I've felt have been inconsistent and would benefit everyone if standardized (I enjoyed reading some of the thoughts on that in your latest paper by the way!). So I think it would be really interesting for the review to end with sort of an open call for the community to work together on maintaining and improving a gold standard benchmarking dataset and associated performance metrics. I'm finishing up a v1.0 of what that might look like and how it might function and will include flopp in that initial benchmarking set, hopefully once it's done you'll be one of the people to find that interesting and worth improving

Best, Omar

OmarOakheart commented 2 years ago

I tested with -c, and I'd just like some clarification on one of the results:

3:125 1 1 1 0:36|1:25 0:6|1:19 0:15|1:58

NC_001224.1 125 . T C 8225.36 . AC=3;AF=1.00;AN=3;BaseQRankSum=2.045;DP=204;FS=3.805;MLEAC=3;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=29.13;ReadPosRankSum=-2.353;SOR=0.594 GT:AD:DP:GQ:PL 1/1/1:6,192:198:97:8241,657,97,0

Here we have a tiny amount more support for 0 in the haplotype (0:36|1:25), but the output is still 1. Any insight into what's allowing the 0 to overcome the coverage disadvantage?

bluenote-1577 commented 2 years ago

I think everything in my example has been 1/1 just because this is a telomeric region and long reads are better at mapping there than short reads, so the variation is diluted in short reads and appears non-existent but long reads reveal that some positions should actually be considered heterozygous. There are SNPs labelled as 0/1 later on in the VCF file. That's a pretty interesting blind spot I'll have to keep in mind in the future though.

So that means a -1 -1 shouldn't be considered to be a break in contiguity. I'll start using the "-c" option, and I think it would be interesting in the future to compare the results between both approaches, with short reads x long reads and with only long reads.

I'm just using a diploid example here since I only wanted to test flopp using a simple case to understand how to interpret its output. Would definitely use a more specialized tool if diploid phasing was the end goal since there are a lot of tricks I'd expect diploid phasing to take advantage of.

I developed nPhase and have been working on a review of polyploid phasing methods. A discussion of some of the trends in how to approach the problem and about the performance metrics and datasets that have been used, which I've felt have been inconsistent and would benefit everyone if standardized (I enjoyed reading some of the thoughts on that in your latest paper by the way!). So I think it would be really interesting for the review to end with sort of an open call for the community to work together on maintaining and improving a gold standard benchmarking dataset and associated performance metrics. I'm finishing up a v1.0 of what that might look like and how it might function and will include flopp in that initial benchmarking set, hopefully once it's done you'll be one of the people to find that interesting and worth improving

Best, Omar

I see. I've played with nPhase before (and e-mailed you in the past, a while back). Thanks for your kind words! Looking forward to your review! Very excited to read about it; do let me know when it comes out :).

I tested with -c, and I'd just like some clarification on one of the results:

3:125 1 1 1 0:36|1:25 0:6|1:19 0:15|1:58

NC_001224.1 125 . T C 8225.36 . AC=3;AF=1.00;AN=3;BaseQRankSum=2.045;DP=204;FS=3.805;MLEAC=3;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=29.13;ReadPosRankSum=-2.353;SOR=0.594 GT:AD:DP:GQ:PL 1/1/1:6,192:198:97:8241,657,97,0

Here we have a tiny amount more support for 0 in the haplotype (0:36|1:25), but the output is still 1. Any insight into what's allowing the 0 to overcome the coverage disadvantage?

Ah, I think you've found a bug in the code. I think I'm able to pinpoint the error and have pushed a change; this should also improve the quality of the results slightly. Hopefully this fixes the issue; it does on my test dataset.

(If you're curious, the issue comes from forgetting to update the consensus alleles after de-duplicating reads within flopp's local blocks. So the allele in your case was 1 before de-duplication, but after removing duplicated reads it's 0, which is what the column correctly states.)

OmarOakheart commented 2 years ago

Just went through my emails, that's right, you're the reason I added the "nphase partial" mode of running! I've ended up finding it very useful for my own purposes as well so thank you again for the suggestion

That was a lightning fast bugfix, it fixed it on my end as well now that I've updated flopp.

3:125 0 1 1 0:36|1:25 0:6|1:19 0:15|1:58

bluenote-1577 commented 2 years ago

I've updated the readme with more information about interpreting flopp's results and the bugfix seemed to have worked; closing.