Correct Nanopore long reads with Pacbio HiFi reads

Johnsonzcode commented 3 years ago

Happy to use fmlrc again and thank you for your excellent work.

Here is the problem : I want to correct Nanopore long reads, but there is no short reads in my data. So Can fmlrc use long reads such as Pacbio HiFi reads to correct Nanopore long reads ? If fmlrc can, how do I set the command parameters ?

Waiting for your reply.

holtjma commented 3 years ago

Previously, the thing preventing correction via long reads in FMLRC was the high error rate. Since PacBio HiFi reads have resolved the majority of this problem, I suspect you could use them to correct against now. Building the BWT should follow the same process as you would for paired-end short reads, only you will only have one read file.

We haven't tested this (though it's been on my radar for a while now), so I can't say for sure what an optimal (or typically good) set of parameters would be. Given that HiFi accuracy is quite high now (>99% IIRC) and assuming you have 30x-ish for the HiFi dataset, I would probably just start with the current defaults (k=21,59) for k-mer sizes. If you're working with diploid or polyploid data, I would also consider lowering the -m parameter (but I wouldn't recommend going below 3). Everything else I would probably leave about the same. I do recommend using -C 10 for speed purposes assuming you have a GB of RAM to spare.

If you're open to sharing results, I'm definitely curious to know how it performs in those tests.

Johnsonzcode commented 3 years ago

I guess I didi't give enough information to you, my HiFi dataset is about 30X and I am working with diploid. My server can supply up to 250 GB RAM now. I am happy to share with you if it works.

Johnsonzcode commented 3 years ago

It works , at least it works smoothly but I don't konw how to check the correction quality.

holtjma commented 3 years ago

Yea, that really depends on your application, and how you measure "success" in that environment. I've used ELECTOR recently, but that was for a simpler non-diploid organism. I don't think it is built to handle ploidy.

If you have some sort of "truth set" (such as Genome in a Bottle samples), you could feed the output into variant calling software and test the downstream impact. If you're doing de novo assembly, then you would probably just be looking at typical de novo stats like N50s, # of contigs, etc.

Johnsonzcode commented 3 years ago

That is to say, if I am doing a de novo assembly , I can check the N50s and # of contigs before and after correction to check correction quality, N50s will be longer and # of contigs will be smaller, right ?

holtjma commented 3 years ago

In general, yes that's what you're looking for and there are usually other metrics that are less clear cut but informative. The original FMLRC paper has a table with a handful of these assembly statistics. You can probably find a collection of other papers doing assembly or read correction who will have similar statistics.

Johnsonzcode commented 3 years ago

How to get error rate ? If I want to know correction quality from N50s and # of contigs , it seems there must be two assembly which are from before and after correction dataset. So if I want to know correction quality directly，“error rate” is a good choice.

holtjma commented 3 years ago

Quast is one that I've used in the past that I think it pretty good. Quast-LG and other variants on that page may also be more appropriate. You will get a bunch of statistics like misassembles, N50, etc.

Johnsonzcode commented 3 years ago

Thank you so much !!!

Johnsonzcode commented 3 years ago

Sorry to submit again, here is my result.

And its mapping rate was been promoted a little bit (about 0.17%)

HudsonAlpha / fmlrc2

Correct Nanopore long reads with Pacbio HiFi reads #5