HudsonAlpha / fmlrc2

Apache License 2.0
43 stars 5 forks source link

About self-correction of ONT #23

Closed Johnsonzcode closed 2 years ago

Johnsonzcode commented 2 years ago

Hi @holtjma,

I am wondering whether FMLRC2 could correct ONT reads themself by apply multiple sequence alignment based correction ? There are 200X depth ONT reads. Half of them QV ~ 13.44 (95.47%).

holtjma commented 2 years ago

Multiple sequence alignment-based correct: The short answer is no, that would require a fundamental algorithm change since FMLRC is k-mer based and those algorithms tend to factor in full sequences (i.e. reads).

As for whether you can do self-correction with ONT: This probably depends on your data. In theory, you can use any dataset as the correction set regardless of whether it's short- or long-read technology. The reason we focused on short-reads as the correction set is because they were less noisy (~1% error rate), so you could have relatively low coverage and perform correction with them fairly accurately. With 200x, you certainly have a lot of coverage, so it's a question of how bad is the ONT error rate.

I remember I tinkered with this once with some PacBio CLR data, but I don't remember it working particularly well at the time. I might not have had enough coverage though. If you have compute time and a method to measure the result, your best bet is just to try it.

Johnsonzcode commented 2 years ago

I got that idea from the human chrY article. But the reads they sequencing containing vector sequence to sure the well alignment.

---Original--- From: "Matt @.> Date: 2022/2/28 22:34:20 To: @.>; Cc: @.**@.>; Subject: Re: [HudsonAlpha/rust-fmlrc] About self-correction of ONT (Issue #23)

Multiple sequence alignment-based correct: The short answer is no, that would require a fundamental algorithm change since FMLRC is k-mer based and those algorithms tend to factor in full sequences (i.e. reads).

As for whether you can do self-correction with ONT: This probably depends on your data. In theory, you can use any dataset as the correction set regardless of whether it's short- or long-read technology. The reason we focused on short-reads as the correction set is because they were less noisy (~1% error rate), so you could have relatively low coverage and perform correction with them fairly accurately. With 200x, you certainly have a lot of coverage, so it's a question of how bad is the ONT error rate.

I remember I tinkered with this once with some PacBio CLR data, but I don't remember it working particularly well at the time. I might not have had enough coverage though. If you have compute time and a method to measure the result, your best bet is just to try it.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

holtjma commented 2 years ago

Closing due to inactivity