About ideal short read coverage, -m -f parameters, r1 and r2 of paired-end input, 3-pass correction and Nanopore data correction

romseg commented 3 years ago

I am wondering what would be the recommended coverage of illumina paired-end short read to use for correction. In my case I have ~270x short read coverage of the genome I am intending to assemble de novo. I would guess that this is too high cov to use, so I can split it into smaller cov values, but:

I am not sure what cov value would be the best for fmlrc2
And what parameter optimization would be needed for the chosen cov value (if other than default), specially on the -m and -f parameters. I would be grateful to have your suggestions. Thanks.

holtjma commented 3 years ago

This is a complicated question that doesn't have an exact answer. For the first bullet, there is no theoretical limit on the coverage, but it can influence how you decide -m and -f (see below). When we were working with E. coli, the coverage was quite high (~1000x IIRC) and it was not an issue. We've also done some limited testing with lower coverage human data (~30x) and it was working fine there also.

For the second bullet, I can think of a handful of things that will influence the answer:

The error rates of your short reads (probably ~1%)
The ploidy (or mixture, if multi-organism like microbiome) of your data
How sensitive you want to be to low-frequency variant
The coverage of the short read dataset

Given answers to those questions, you have two levers to pull in fmlrc2 (which you already pointed out):

-m - this will change the absolute minimum count that is considered "solid" or real (default: 5)
-f - this will change the dynamic fraction minimum count that is considered "solid" or real (default: 0.1, i.e. 10%)

So for example, if you have 270x coverage short reads with a 1% error rate, a diploid organism, and you are planning to ignore low-frequency variation (i.e. de novo/mosaic variants); then on average, you're going to have 2.7x error bases per variant. This is below the absolute default of -m=5x, and well below the expected fractional of -f = 270*0.1 = 27x, so you likely would not need to change any of the default parameters.

If you know the error rate is higher OR there is some contaminant in your data you want to ignore OR you don't care about the diploid nature of your data, then you might consider raising those criteria to higher levels. This will basically create a stricter correction, and some may even call this an "overcorrection" in some ways. However, it will likely run faster (i.e. less CPU time due to fewer paths to explore) and be easier to assemble downstream due to higher consistency.

On the other hand, if you want to be very sensitive to low level variation OR if you're working with microbiome data, you might consider lowering the criteria so you don't accidentally correct true variant. I would not recommend going below -m 3 though or lowering -f to be too close to your short-read error rate as this may lead to underperformance both in the result and in run-time.

Does that help? Sorry for the wall of text, but it's all very data and use-case dependent.

romseg commented 3 years ago

It definitely helps! Thanks for the detailed explanation. It is clear.

One other detail is if it is possible to use only the R2 of a paired-end short read data for correction? My R2 sequence I believe is reverse complemented. Would this create a problem for the fmlrc2 process?

I'd like to use only R2 reads because in my case R2 is 150 bp long, whereas my R1 is 125 bp long, and since I have plenty of coverage I'd like to use only the longest short reads for correction (R2) in the hope that this would benefit the correction process, but I am not sure if this is actually true. So if I use R2 only I would end up with 135x coverage and likely default parameters would still be safe (my genome is diploid and planning to ignore low-frequency variation) Thanks!

holtjma commented 3 years ago

So the entire reads are loaded into the BWT, and assuming they are in separate files (i.e. r1 and r2 files), you could just create a BWT with one end of the pair instead of both.

However, fmlrc only performs queries on k-mers (i.e. substrings) of the reads at lengths 21 and 59 by default, and it is smart enough to look for the k-mer and it's reverse complement when performing the query. Since those k-mer lengths are quite a bit smaller than your 125 r1 reads, you will likely only hurt the performance by removing them. Not only would you be losing information from your sequencing, but it's possible you may be injecting a strand bias by removing one end of the reads.

Of course, if compute isn't a major issue, you could always just try it both ways and see how it works out.

romseg commented 3 years ago

I'm glad to know this. I will definitely use both pairs of my reads, I would not like any possibility of strand bias on my long reads. There seems to be no reason to use fewer reads and miss sequence information.

I noticed that Fmlrc2 allows 3-pass correction (e.g. k-mers 21, 59, 79), do you think it would be a good idea to add 79 to the default 21 and 59 k-mers to my correction job? What is the argument to add '21,59,79'? I tried various argument combinations by it did not work, either the job stops as 'argument wasn't expected' or only the default (21,59) was recognized. Thanks.

holtjma commented 3 years ago

To do the multiple k-mer sizes, you just include each one with a space in between like this:

fmlrc2 -k 21 59 79 ...

As for the practicality of a third pass, we don't really know. That's also something that may be worth testing, but that I don't have any evidence for or against.

Theoretically, longer k-mers will increase the accuracy of the correction by allowing you to find more unique anchors (this was shown with the second pass at 59). However, as your k-mer gets longer, you will lose evidence because either the k-mer is only partially in a read or you have a sequencing error, and at a certain point there will be no benefit to larger k-mers because of that loss.

romseg commented 3 years ago

It is very useful to know this information. Since various topics were covered in this issue I edited its title to reflect more accurately the Q/A addressed here, so that other users can see it too.

Finally and as a general outline about correction of Nanopore data, in the fmlrc paper it was mentioned that "Our method is applicable to both Pacbio SMRT sequencing and nanopore sequencing datasets, however further parameter optimization may improve its accuracy and efficiency for nanopore sequences, which exhibit a slightly different error profile than Pacbio." So I am wondering what would be those parameters/optimization that may improve Nanopore accuracy? and how accurate/efficient is fmlrc2 at correcting Nanopore vs PacBio data? Thanks so much for your help! Every answer was very valuable.

holtjma commented 3 years ago

We have done some limited testing on Nanopore data, mostly for the purpose of verifying that there wasn't some strange compatibility issue and that we were getting results that made sense. IIRC, all of the paper analysis was PacBio.

As far as CLI parameters, the main things are based on the error rates. The PacBio CLR we originally tested with had ~10-15% error rate I believe, but I'm not sure what it is for recent Nanopore sequencing. If your error rate was higher, you would want to use smaller k-mers, especially for the first pass so you can actually get some anchors to the sequence. Whereas with a lower error rate, you can make your k-mers larger because there's a higher chance of hitting them. I believe in recent benchmarks (such as this one or this other one), they just used the default settings of fmlrc without any specific tuning.

Most of the tuning that I think would be most relevant are really new features that aren't even implemented in either fmlrc or fmlrc2. This would include things like a custom error model (i.e. PacBio has higher indel than mismatch rate), a different anchoring strategy depending on the inputs, etc.

romseg commented 3 years ago

Great! In the benchmark papers Fmlrc2 performs very well for correction of Nanopore data with short reads.

With the new improvements introduced to the Nanopore basecaller starting in Guppy 3.6.0, the accuracy increased between 1 - 2% over previous versions. So I don't think the error rate of my Nanopore data is higher than ~10-15% and probably just the default settings are very suitable in my case. Thanks.

holtjma commented 3 years ago

Yep, I'm going to close this for now, but if you have other questions, feel free to re-open it. Thanks for using fmlrc!

HudsonAlpha / fmlrc2

About ideal short read coverage, -m -f parameters, r1 and r2 of paired-end input, 3-pass correction and Nanopore data correction #7