bcgsc / abyss

:microscope: Assemble large genomes using short reads
http://www.bcgsc.ca/platform/bioinfo/software/abyss
Other
313 stars 108 forks source link

Question about the 'n' (N) parameter #258

Closed lsterck closed 6 years ago

lsterck commented 6 years ago

Hi,

I'm on the contig/scaffolding stage of my assembly for which we can set the n parameter, affecting the number or pairs there are needy to link two sequences I was wondering how one should actually interpret this parameter. if n=10 (default) does this then means that I need at least 10 from a single library or 10 from all libraries considered together? Since this is set in DistanceEst, I fear the former.

What are the recommendations/interpretations for this? thanks.

benvvalk commented 6 years ago

Hi @lsterck,

I think you are right. Since -n10 is supplied to DistanceEst, it would only generate distance estimates for contig pairs with >= 10 supporting read pair mappings. And that effectively means that you need to have >= 10 supporting read pairs from at least one of the libraries.

You can work around that behaviour by setting ${LIBNAME}_n=5 to require only 5 supporting pairs per individual library, where ${LIBNAME} is the variable name that you used for your PET/MPET library on the abyss-pe command-line. For example:

abyss-pe k=96 name=ecoli lib=mylib1 mylib1='lib1-read1.fq lib1-read2.fq' mylib2='lib2-read1.fq lib2-read2.fq' mylib1_n=5 mylib2_n=5 N=10

would have the effect of requiring >= 5 supporting pairs for each individual library, and >= 10 supporting read pairs from all of the libraries pooled together.

sjackman commented 6 years ago

How many libraries do you have, and what are their intended fragment sizes?

lsterck commented 6 years ago

Hi @benvvalk , Ok , nice to see my suspicion is correct. Ah, I was not aware of the possibility to set n for a specific lib on the cmdline. Will give that try. Do I understand correctly that it works both on the individual libraries as on all of them together? A bit of a related question then, I usually map the reaps in a separate approach, is it then OK to just run/map some libraries with eg. n=5 and others with n=10, or do I need to make sure these values are also passed to downstream steps?

I think (just a suggestion) that it might be beneficial to elaborate on this a bit more in the manuals, as i can imagine that other people might be struggling with this.

lsterck commented 6 years ago

@sjackman ,

I have 38 PE libraries, 10 Illumina MP (poor quality, old protocol for MP) and another 50 libraires 454 Mp (high quality but very low number of reads). Some will have similar insert sizes etc, other will be quite different. Illumina-MP are about 3 & 5 Kb inserts, 454 are ~8Kb I'm guessing your hinting at pooling libs together to increase the potential number of pairs? I can see if that is feasible.

benvvalk commented 6 years ago

@lsterck

Do I understand correctly that it works both on the individual libraries as on all of them together?

Yep. It is two-stage filtering.

$libname_n becomes an argument to DistanceEst which is used for filtering libraries individually. If $libname_n is not set, then it defaults to N. N is passed to abyss-scaffold and does the second stage of filtering, after the (individually filtered) evidence from all of the libraries have been pooled together.

Yes it is fine to use different values of $libname_n for different libraries, during your external alignment process. And there is no requirement that those n values agree with the N you use for scaffolding.

benvvalk commented 6 years ago

@lsterck It is not recommend to pool libraries together if they have different insert size distributions. ABySS will do a better job of distance estimation if you keep libraries with different insert sizes separate.

sjackman commented 6 years ago

I'm guessing your hinting at pooling libs together to increase the potential number of pairs? I can see if that is feasible.

Yes. If the fragment size distributions are at all similar, I suggest pooling them together. Say if the IQR of the fragment size distributions overlap, pool them.

sjackman commented 6 years ago

ABySS doesn't support assembling 454 data. I'd suggest omitting this data for now, though perhaps it could be used to scaffold the assembly using LINKS.

sjackman commented 6 years ago

Or better, try two assemblies, with and without the 454 data, and see which performs better. You may want to try correcting the 454 reads first before feeding them to ABySS.

lsterck commented 6 years ago

$libname_n becomes an argument to DistanceEst which is used for filtering libraries individually. If $libname_n is not set, then it defaults to N. N is passed to abyss-scaffold and does the second stage of filtering, after the (individually filtered) evidence from all of the libraries have been pooled together.

How does this then works for the contiging stage? N is only applicable for the scaffolding stage, right?

lsterck commented 6 years ago

ABySS doesn't support assembling 454 data. I'd suggest omitting this data for now, though perhaps it could be used to scaffold the assembly using LINKS.

I can see why, but unfortunately that's my bulk of (useful) data, the illumina are old school MP libraries, and as far as I can see it's a mixture of PE and MP (with the majority being PE :( ). Using LINKS separately is an excellent suggestion though. Might give that a try as well.

sjackman commented 6 years ago

You could try first assembling the 454 data with Miniasm, or Canu, or Flye, or possibly a hybrid assembly with Unicycler.

lsterck commented 6 years ago

According to what I read here (and how I understand it) if I have for instance 1000 contigs and 5000 reads aligning to them it makes no sense to use n=10 but then I should rather go for n=5?

sjackman commented 6 years ago

May want to try even smaller values. Try n=3 N=1-10. Try a few different values of n.

sjackman commented 6 years ago

May want to try smaller values of s as well. Default is 1000, try s=500 and s=200.

lsterck commented 6 years ago

You could try first assembling the 454 data with Miniasm, or Canu, or Flye, or possibly a hybrid assembly with Unicycler. I have way to little coverage for that kind of approaches. I think I have 1~2 x of 454 . I have roughly 10x of (useless) illumina MP . I do also understand I'm stretching things here ;-)

lsterck commented 6 years ago

May want to try smaller values of s as well. Default is 1000, try s=500 and s=200.

Been there ;) and doing that . I'm at 300 now as you once mentioned you should (preferably) stay above 3x Kmer-size .

sjackman commented 6 years ago

Try a hybrid assembly of Illumina and 454 using Unicycler.

lsterck commented 6 years ago

Unicycler is an assembly pipeline for bacterial genomes.

hmmm, "bacterial genomes", think other end of the genome size spectrum :/

sjackman commented 6 years ago

How big is your genome?

lsterck commented 6 years ago

estimated 25Gb :/ Don't get me wrong I do not expect (anymore) to get something decent out of this but I'm starting a new project (9Gb estimated, with much more and recent data) so I'm using this one more as a testcase to get a feel of the parameters etc.

sjackman commented 6 years ago

Exciting! We're assembling 20 Gbp genomes using 10x Genomics Chromium linked reads. It's been a good fit for us.

lsterck commented 6 years ago

So if I have from a previous mapping an estimate of the amount of reads that map (when abyss-map reports the numbers is this then raid pairs or reads? == do I divide by 2 to get the #of pairs?) , I should be able to guesstimate the min value for n (and/or N), correct?

Are there any objections to always do the mapping with n=1 and then do more stringent in the next steps? Along the same line, @sjackman where you said : Try n=3 N=1-10 would this not better be n=1 N=3-10? As you would expect that over all libs the #pairs should be higher and you can thus be more stringent?

sjackman commented 6 years ago

N is for the scaffolding stage, and ABySS can optimize that parameter, so N=1-10 tells ABySS to go ahead and do that. The contifging stage n is not automatically optimized, so you'll have to pick a value, or manually optimize it. I picked n=3 out of thin air.

lsterck commented 6 years ago

I was just thinking that since nis on a per library basis and Non an all library basis, you need to be more lenient on nthan on N?

sjackman commented 6 years ago

foo_n is per library and both n and N are for all the data summed up.

sjackman commented 6 years ago

The drive Makefile script is a bit tangled, but here's the relevant lines: https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L308 https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L311 https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L345 https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L704

lsterck commented 6 years ago

I ran a few test and it seems that when I do my mapping step (abyss-map, DistanceEst) with a certain nvalue and afterwards I do the rest of abyss up to contig level it does not matter anymore which value of nI provide in the abyss cmdline, I get exactly the same result of contigs. What am I missing?

On the other hand, I also ran a test with n=1 (where it was n=5 before): it greatly improved my stats, like 3-fold . (I checked the amount of reads mapped and given the number of untigs I would only have an average of 1,7 pairs linking untigs, so the improvement makes sense here)

lsterck commented 6 years ago

I dry-runned a number of parameter combinations, eg using this <lib>_n= option (which does indeed works as described previously here) and I do not see the nparameter changed anywhere else then in the DistanceEst cmdline. Can that be?

sjackman commented 6 years ago

The lib_n parameter affects only DistanceEst. The n and N parameters affect SimpleGraph and abyss-scaffold respectively.

lsterck commented 6 years ago

OK, Can it be that this is a 'recent' implementation? I notice that in the version 2.0.2 n (as well as s) is not passed on the cmdline to simpleGraph, but in the latest version it is. If so , would it then make sense to switch to the latest version even for an assembly I started with version 2.0.2 or would that not make any difference (or even negative)?

sjackman commented 6 years ago

I thought it was older, but you're right. That feature only goes back to 2.1.0. See https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L310-L314 and https://github.com/bcgsc/abyss/commit/da79216d23e8a686fdce10a31bb6054741b2766d

Yes, you can resume an assembly started with 2.0.2 using 2.1.1 and that's fine.

lsterck commented 6 years ago

OK. apparently some part of the confusion was thus version based ;-) otherwise I think I nearly have this down.

one last thing before I can safely close this issue: I have in the meanwhile ran a test for the scaffolding using n=1 N=1-10 and in the scaffolding stage it does indeed tries all different combinations to find the best one. In this case it returned N=10 with the smallest contig size. I understand that the smallest will likely give the best results (simply more data to work with) I don't get why the optimum is N=10 ? How come that lower values for N don't end up with a more scaffolded result/stats?

52.39e6 4712812 436963  300     3685    9867    19142   13370   125633  16e9    n=2 s=200
51.24e6 4006524 370459  300     4569    11759   22518   15767   160497  15.99e9 n=5 s=200
50.62e6 3621893 333202  300     5337    13272   24978   17536   182097  16.02e9 n=8 s=200
50.5e6  3552470 326088  300     5558    13663   25538   17945   182097  16.06e9 n=10 s=200

with a lover N value one should be able to link more contigs, no? so where is the trade-off for picking a larger one?

lsterck commented 6 years ago

So from version 2.1.0 I can run the mapping step with n=1 and then 'optimise' the n for SimpleGraph without re-doing the mapping, right. Or is there a reason why to use also a larger n for DistanceEst ?

benvvalk commented 6 years ago

@lsterck The scaffolding stage optimizes N to maximize N50. Currently, there is no equivalent automatic optimization of n at the contig stage (SimpleGraph).

The general trade-off with low/high n/N is between misassemblies due to linking contigs with insufficient evidence (n/N set too low) and getting low contiguity/N50 by requiring too much evidence (n/N set too high).

I can't think of a good reason to use a larger n than n=1 for the initial DistanceEst step before SimpleGraph.

sjackman commented 6 years ago

I can't think of a good reason to use a larger n than n=1 for the initial DistanceEst step before SimpleGraph.

to reduce the size of the files output by DistanceEst to use different values of n for different libraries

So from version 2.1.0 I can run the mapping step with n=1 and then 'optimise' the n for SimpleGraph without re-doing the mapping, right.

Correct.

If N=10 is giving the best result, I'd suggest increasing your search range to N=1-20.

How come that lower values for N don't end up with a more scaffolded result/stats?

That's a question that I'd love to know that global answer to. It's likely different for every organism and sequencing run. In short though, noise due to mispairing (much more common with mate-pair than with paired-end) and due to mismapping.

lsterck commented 6 years ago

thanks both, got it now.

and the reason I start getting a few mentions of these kind :

warning: the head of 175727402+ does not match the tail of the previous contig
ACATGGAGCTAAGACA
catggagctaagACAT

is then likely due to stretching the parameters into the misassembly range.

sjackman commented 6 years ago
ACATGGAGCTAAGACA
 |||||||||||||||
 catggagctaagACAT

In this case there is a perfect overlap between the two sequences, so that looks like a bug in ABySS. The result will be that the output contig will contain the sequence

ACATGGAGCTAAGACAncatggagctaagACAT

rather than the desired

ACATGGAGCTAAGACAT

You can likely fix this issue by polishing SNVs and indels with Pilon after assembly with ABySS, which is a good idea anyway.

sjackman commented 6 years ago

If abyss-scaffold -N10 is producing the best scaffold result, I'd expect larger values of n than -n1 to produce the best SimpleGraph results. Let us know how your parameter optimization goes.

lsterck commented 6 years ago

OK, I ran the N=1-20 range and the optimum is indeed higher than 10, though not much. here is an extract of the parameter results:

50.54e6 3571755 328125  300     5477    13534   25359   17826   182097  16.04e9 n=9 s=200
50.5e6  3552470 326088  300     5558    13663   25538   17945   182097  16.06e9 n=10 s=200
50.5e6  3556710 326231  300     5585    13688   25546   17958   182097  16.08e9 n=11 s=200
50.54e6 3578176 327873  300     5573    13643   25439   17883   182097  16.09e9 n=12 s=200
50.6e6  3612593 330932  300     5536    13542   25214   17732   191771  16.11e9 n=13 s=200

Best scaffold N50 is 13688 at n=11 s=200.

n       n:300   L50     min     N75     N50     N25     E-size  max     sum     name
50.5e6  3556710 326231  300     5585    13688   25546   17958   182097  16.08e9 n=11 s=200

indeed n=11 is the optimum, some of the higher ones produce large max, but since the selection is based on N50, this fits.

What I am I bit surprised from however is that the range for s goes from 200-2000, where I asked 300-1500. abyss-scaffold -v -k85 -s300-1500 -n6-20 -g

sjackman commented 6 years ago

What I am I bit surprised from however is that the range for s goes from 200-2000, where I asked 300-1500.

It always uses a 1,2,5 logarithmic scale, that is 1000, 2000, 5000, 10000, et c.

lsterck commented 6 years ago

So I finished the noptimization for the contig level. In general I got the best results with a low nfor the DIstanceEst part and a larger one for the the SImpleGraph step. In all my test I was 'loosing" too much info if I increased nfor DistanceEst, the difference while changing nfor SimpleGraph were less outspoken but still apparent . In my case n=1 for DistanceEst and n=5 for SimpleGraph gave the best N50 ( N25 N75 as well, only the max length was sometimes better for other combinations) . I ran a similar analysis on a different dataset and that gave in general comparable results. Unless all this is related to the fact I'm assembling some pretty huge genomes (>10Gb sizes) and data coverage is less than typically for smaller genomes?

Nonetheless, I think this opens an option for improving the verbose output. Just as it advises on different c values, ABySS could inform about better (usually lower) DistanceEst n values? (if you know the number of genomic unitigs and the number of unitig spanning PE reads you can get the theoretical average, no?

lsterck commented 6 years ago

It always uses a 1,2,5 logarithmic scale, that is 1000, 2000, 5000, 10000, et c.

so what is the minimum value then? It seems to work with 200 but nothing below that, can that be? It does says for instance 175 as min in the assembly stats line but it also reports n=X s=200 in the Noptimisation table.

sjackman commented 6 years ago

Yes, it's possible that ABySS could compute the physical fragment coverage and make recommendations for the value of n based on that. It's not on our radar though.

It's not recommended to use a value of s that is much less than 3 times the value of k. It tends to cause drastic over-assembly of the genome.

benvvalk commented 6 years ago

@lsterck Thanks for sharing your results and that makes sense about your optimal n values at the contig stage (i.e. SimpleGraph).

benvvalk commented 6 years ago

Considering this question answered and closing the ticket.

lsterck commented 6 years ago

Considering this question answered and closing the ticket.

agreed and thanks for the support on this. I still hope that this 'thread' will inspire other people to have a look at this when doing assembly with ABySS as this in my case drastically improved the assembly result (as in orders of magnitude)

sjackman commented 6 years ago

Wow. Glad to hear it. Could you report your assembly stats (as reported by abyss-fac with your best guess of genome size -G) before and after your assembly parameter tweaking?

lsterck commented 6 years ago

OK, I'll see what I can find back (work has been spread out over quite a period :/ ) and I might have been a little over enthusiastic with my previous statement ;) , but there is remarkable improvement at least

lsterck commented 6 years ago

Hi @sjackman ,

here is what I found back. one of the first ones I ran (with likel still also an older version of abyss (v1.9.0 ?) )

n       n:200   L50     LG50    NG50    min     N80     N50     N20     E-size  max     sum     name
75.35e6 20.9e6  1584970 6406021 454     200     390     2393    7474    4293    67969   17.51e9 AllData_k85/ppinAll85-contigs.fa

the last one after tweaking (min length to use, n for libs, n for graph, ...); s=200, n_libs=1, n_graph=5

n       n:200   L50     LG50    NG50    min     N75     N50     N25     E-size  max     sum     name
27.15e6 5830078 356522  557922  9118    200     4504    14119   28798   19875   222900  19.42e9 AllData_k85/ppinAll85-contigs.fa

the estimated genome size I've put at 24Gb (yes indeed, I do not get to that one for the moment)