HudsonAlpha / fmlrc2

Apache License 2.0
43 stars 5 forks source link

Do heterozygous variants remain in long reads after correction? #12

Closed Yutang-ETH closed 3 years ago

Yutang-ETH commented 3 years ago

Dear fmlrc developers,

Thank you very much for developing such a beautiful tool for us. I am trying to use fmlrc to correct my ONT long reads (42x) with Illumina short reads (80x) from the same biological sample. Our target genome is a diploid heterozygous genome and from the short-read k-mer distribution, I can see two clear peaks, which means we have reads from both haplotypes (I also believe my long reads are from both haplotypes)

My question is does fmlrc distinguish heterozygous variants from real sequencing errors, will the heterozygous variants remain in long reads after correction? I would like to try read-based phasing later, so I would expect to have correct long reads from both haplotypes. Could you please give me any insights? Thanks very much.

Best, Yutang

holtjma commented 3 years ago

Yutang,

There is no explicit differentiation between sequencing errors and heterozygosity built into fmlrc. However, there is implicit differentiations assuming your errors are "random" or at least significantly lower representation than your real variant signal.

Assuming your variants are sufficiently represented in the data, then you basically want to find the sweet spot where the count for error k-mers is below the filtering threshold and the counts for heterozygous variant k-mers are above that threshold. The main parameters to tinker with will be:

  1. the k-mer sizes, as these influence your counts,
  2. -m - the minimum number of k-mers to be considered "real", and
  3. -f - the dynamic threshold for if you have higher than expected counts

It's entirely possible that the default values of -m 5 -f 0.1 will work here, but I haven't explicitly tested it. For more details, similar concepts have been brought up in this topic and this older topic.

Happy to answer follow up questions if those explanations need more description.

Matt

holtjma commented 3 years ago

As a follow up, since you have k-mer distributions to look at, that will be a good way to inform your choice of -m for that k-mer size.

Yutang-ETH commented 3 years ago

Hi Matt,

Interesting! Thank you very much for your explanation and dcopetti who posted the older topic was my colleague. In my k-mer distribution, the heterozygous peak is at ~40x and the homozygous peak is at ~80x, so I think -m 5 is fine for my analysis but I could increase it a bit to 10 or something. Anyway, I would like to try the default setting first. What do you think?

Yutang

holtjma commented 3 years ago

Yea, that sounds like a good initial approach to me. If you have a way to benchmark/evaluate afterwards (depends on your organism and what you know about it already), that would definitely help inform which one is "better".

Yutang-ETH commented 3 years ago

To be honest, I have no idea for benchmarking right now, but thank you very much for all your input. I will close this issue now and I may reopen it in the future when I have other questions. Cheers, Yutang

Yutang-ETH commented 3 years ago

Hi Matt,

My FMLRC2 program finished within 4 days, I think this is really quick considering that I was correcting 108 gb long reads. My colleague tried Canu correction method with the same data, he said it took like forever and he could not get the corrected reads. So I think the benchmark paper is correct, FMLRC2 does perform very well. Thank you very much for developing this masterpiece.

Since I have the corrected long reads, I thought why not I compare kmers between my short reads and the corrected long reads. From the kmer comparison plot, I think I should be able to tell whether the heterozygous variants were preserved. So, I did KAT and I got the plot.

short_corrected_long

As you can see from the plot, in the heterozygous peak (the first peak), heterozygous kmers from both haplotypes that are present in the short reads are also present in my long reads and the kmer frequency is greater than 6. I think this is true since I have 40x long reads, and I expect kmers from each haplotype to have 20x coverage. If most of the true heterozygous kmers were corrected and only one haplotype left after correction, I should see a large black area in the first peak, but this is not the case. So, I think FMLRC2 did preserve true heterozygous variants in my long reads. What do you think?

As for the read accuracy after correction, since we don't have a high-quality reference genome at this moment, I don't know how to evaluate the accuracy, but I believe the error rate is dramatically decreased, otherwise I should not obtain such a nice kmer plot.

Thanks again for developing this nice tool for us. Best wishes, Yutang

holtjma commented 3 years ago

Yutang,

If I understand the plot correctly, you're creating one of these figures against your read dataset instead of a reference genome, right?

If that's true, then I agree with what you're saying. It looks like you have a high heterozygous peak and another visible homozygous peak at about double, and the vast majority falling on k-mers that are present in your short-read set. Glad to hear it worked out for you!

Matt

Yutang-ETH commented 3 years ago

Hi Matt,

Yes, you are correct. The kat plot I made is comparing kmers between two read files. The heterozygosity is really high in our target genome, I like that but I also hate that, very complicated feeling.

Anyway, thank you very much for all your help and support. I will recommend your program to my colleagues.

Best wishes, Yutang

Yutang-ETH commented 3 years ago

Hi Matt,

Good news! I made a chromosome-level assembly with my raw ONT long reads and then I aligned the fmlrc2 error-corrected ONT long reads to the chromosome. I got some statistics from nanoplot and I could see the average percent identity is increased to 90% compared to 85% of raw ONT read alignment. This indicates fmlrc2 did improved the quality of ONT long reads. From IGV, I can see that some regions may have more than 99% identity. However, there are still some regions have many mismatches or indels, I guess these are repeats and they are difficult for fmlrc2 to correct since it's hard to build the graph with short reads, right?

Regarding whether true variants remain after correction, I mapped short reads to the chromosome and called SNPs and visualized the bam from raw ont and corrected ont. I did see good examples like this: read_error_correction_benchmark

Anyway, I think I can use fmlrc2 corrected ont long reads for further analyses. Brilliant masterpiece, Matt, thank you very much.

Best wishes, Yutang

holtjma commented 3 years ago

However, there are still some regions have many mismatches or indels, I guess these are repeats and they are difficult for fmlrc2 to correct since it's hard to build the graph with short reads, right?

Probably, there would need to be some sort of meta-analysis on the regions to see if they are low-complexity, duplicated, or some other biological phenomenon that makes correction difficult.

Glad to see that you were still able to find a spot to retain your variants, let me know if you publish your results somewhere, I'd be interested to read the paper!