Closed lsterck closed 6 years ago
Hi @lsterck,
I think you are right. Since -n10
is supplied to DistanceEst
, it would only generate distance estimates for contig pairs with >= 10 supporting read pair mappings. And that effectively means that you need to have >= 10 supporting read pairs from at least one of the libraries.
You can work around that behaviour by setting ${LIBNAME}_n=5
to require only 5 supporting pairs per individual library, where ${LIBNAME}
is the variable name that you used for your PET/MPET library on the abyss-pe
command-line. For example:
abyss-pe k=96 name=ecoli lib=mylib1 mylib1='lib1-read1.fq lib1-read2.fq' mylib2='lib2-read1.fq lib2-read2.fq' mylib1_n=5 mylib2_n=5 N=10
would have the effect of requiring >= 5 supporting pairs for each individual library, and >= 10 supporting read pairs from all of the libraries pooled together.
How many libraries do you have, and what are their intended fragment sizes?
Hi @benvvalk , Ok , nice to see my suspicion is correct. Ah, I was not aware of the possibility to set n for a specific lib on the cmdline. Will give that try. Do I understand correctly that it works both on the individual libraries as on all of them together? A bit of a related question then, I usually map the reaps in a separate approach, is it then OK to just run/map some libraries with eg. n=5 and others with n=10, or do I need to make sure these values are also passed to downstream steps?
I think (just a suggestion) that it might be beneficial to elaborate on this a bit more in the manuals, as i can imagine that other people might be struggling with this.
@sjackman ,
I have 38 PE libraries, 10 Illumina MP (poor quality, old protocol for MP) and another 50 libraires 454 Mp (high quality but very low number of reads). Some will have similar insert sizes etc, other will be quite different. Illumina-MP are about 3 & 5 Kb inserts, 454 are ~8Kb I'm guessing your hinting at pooling libs together to increase the potential number of pairs? I can see if that is feasible.
@lsterck
Do I understand correctly that it works both on the individual libraries as on all of them together?
Yep. It is two-stage filtering.
$libname_n
becomes an argument to DistanceEst
which is used for filtering libraries individually. If $libname_n
is not set, then it defaults to N
. N
is passed to abyss-scaffold
and does the second stage of filtering, after the (individually filtered) evidence from all of the libraries have been pooled together.
Yes it is fine to use different values of $libname_n
for different libraries, during your external alignment process. And there is no requirement that those n
values agree with the N
you use for scaffolding.
@lsterck It is not recommend to pool libraries together if they have different insert size distributions. ABySS will do a better job of distance estimation if you keep libraries with different insert sizes separate.
I'm guessing your hinting at pooling libs together to increase the potential number of pairs? I can see if that is feasible.
Yes. If the fragment size distributions are at all similar, I suggest pooling them together. Say if the IQR of the fragment size distributions overlap, pool them.
ABySS doesn't support assembling 454 data. I'd suggest omitting this data for now, though perhaps it could be used to scaffold the assembly using LINKS.
Or better, try two assemblies, with and without the 454 data, and see which performs better. You may want to try correcting the 454 reads first before feeding them to ABySS.
$libname_n becomes an argument to DistanceEst which is used for filtering libraries individually. If $libname_n is not set, then it defaults to N. N is passed to abyss-scaffold and does the second stage of filtering, after the (individually filtered) evidence from all of the libraries have been pooled together.
How does this then works for the contiging stage? N is only applicable for the scaffolding stage, right?
ABySS doesn't support assembling 454 data. I'd suggest omitting this data for now, though perhaps it could be used to scaffold the assembly using LINKS.
I can see why, but unfortunately that's my bulk of (useful) data, the illumina are old school MP libraries, and as far as I can see it's a mixture of PE and MP (with the majority being PE :( ). Using LINKS separately is an excellent suggestion though. Might give that a try as well.
You could try first assembling the 454 data with Miniasm, or Canu, or Flye, or possibly a hybrid assembly with Unicycler.
According to what I read here (and how I understand it) if I have for instance 1000 contigs and 5000 reads aligning to them it makes no sense to use n=10
but then I should rather go for n=5
?
May want to try even smaller values. Try n=3 N=1-10
. Try a few different values of n
.
May want to try smaller values of s
as well. Default is 1000, try s=500
and s=200
.
You could try first assembling the 454 data with Miniasm, or Canu, or Flye, or possibly a hybrid assembly with Unicycler. I have way to little coverage for that kind of approaches. I think I have 1~2 x of 454 . I have roughly 10x of (useless) illumina MP . I do also understand I'm stretching things here ;-)
May want to try smaller values of s as well. Default is 1000, try s=500 and s=200.
Been there ;) and doing that . I'm at 300 now as you once mentioned you should (preferably) stay above 3x Kmer-size .
Try a hybrid assembly of Illumina and 454 using Unicycler.
Unicycler is an assembly pipeline for bacterial genomes.
hmmm, "bacterial genomes", think other end of the genome size spectrum :/
How big is your genome?
estimated 25Gb :/ Don't get me wrong I do not expect (anymore) to get something decent out of this but I'm starting a new project (9Gb estimated, with much more and recent data) so I'm using this one more as a testcase to get a feel of the parameters etc.
Exciting! We're assembling 20 Gbp genomes using 10x Genomics Chromium linked reads. It's been a good fit for us.
So if I have from a previous mapping an estimate of the amount of reads that map (when abyss-map reports the numbers is this then raid pairs or reads? == do I divide by 2 to get the #of pairs?) , I should be able to guesstimate the min value for n (and/or N), correct?
Are there any objections to always do the mapping with n=1 and then do more stringent in the next steps? Along the same line, @sjackman where you said : Try n=3 N=1-10
would this not better be n=1 N=3-10
? As you would expect that over all libs the #pairs should be higher and you can thus be more stringent?
N
is for the scaffolding stage, and ABySS can optimize that parameter, so N=1-10
tells ABySS to go ahead and do that.
The contifging stage n
is not automatically optimized, so you'll have to pick a value, or manually optimize it. I picked n=3
out of thin air.
I was just thinking that since n
is on a per library basis and N
on an all library basis, you need to be more lenient on n
than on N
?
foo_n
is per library and both n
and N
are for all the data summed up.
The drive Makefile script is a bit tangled, but here's the relevant lines: https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L308 https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L311 https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L345 https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L704
I ran a few test and it seems that when I do my mapping step (abyss-map, DistanceEst) with a certain n
value and afterwards I do the rest of abyss up to contig level it does not matter anymore which value of n
I provide in the abyss cmdline, I get exactly the same result of contigs.
What am I missing?
On the other hand, I also ran a test with n=1
(where it was n=5
before): it greatly improved my stats, like 3-fold . (I checked the amount of reads mapped and given the number of untigs I would only have an average of 1,7 pairs linking untigs, so the improvement makes sense here)
I dry-runned a number of parameter combinations, eg using this <lib>_n=
option (which does indeed works as described previously here) and I do not see the n
parameter changed anywhere else then in the DistanceEst cmdline. Can that be?
The lib_n
parameter affects only DistanceEst. The n
and N
parameters affect SimpleGraph
and abyss-scaffold
respectively.
OK,
Can it be that this is a 'recent' implementation? I notice that in the version 2.0.2 n
(as well as s
) is not passed on the cmdline to simpleGraph, but in the latest version it is.
If so , would it then make sense to switch to the latest version even for an assembly I started with version 2.0.2 or would that not make any difference (or even negative)?
I thought it was older, but you're right. That feature only goes back to 2.1.0. See https://github.com/bcgsc/abyss/blob/116d2e3d045de3b479f55bc5a7792dc3b4d6fbae/bin/abyss-pe#L310-L314 and https://github.com/bcgsc/abyss/commit/da79216d23e8a686fdce10a31bb6054741b2766d
Yes, you can resume an assembly started with 2.0.2 using 2.1.1 and that's fine.
OK. apparently some part of the confusion was thus version based ;-) otherwise I think I nearly have this down.
one last thing before I can safely close this issue:
I have in the meanwhile ran a test for the scaffolding using n=1 N=1-10
and in the scaffolding stage it does indeed tries all different combinations to find the best one. In this case it returned N=10
with the smallest contig size. I understand that the smallest will likely give the best results (simply more data to work with) I don't get why the optimum is N=10
? How come that lower values for N don't end up with a more scaffolded result/stats?
52.39e6 4712812 436963 300 3685 9867 19142 13370 125633 16e9 n=2 s=200
51.24e6 4006524 370459 300 4569 11759 22518 15767 160497 15.99e9 n=5 s=200
50.62e6 3621893 333202 300 5337 13272 24978 17536 182097 16.02e9 n=8 s=200
50.5e6 3552470 326088 300 5558 13663 25538 17945 182097 16.06e9 n=10 s=200
with a lover N
value one should be able to link more contigs, no? so where is the trade-off for picking a larger one?
So from version 2.1.0 I can run the mapping step with n=1
and then 'optimise' the n
for SimpleGraph without re-doing the mapping, right. Or is there a reason why to use also a larger n
for DistanceEst ?
@lsterck The scaffolding stage optimizes N
to maximize N50. Currently, there is no equivalent automatic optimization of n
at the contig stage (SimpleGraph
).
The general trade-off with low/high n
/N
is between misassemblies due to linking contigs with insufficient evidence (n
/N
set too low) and getting low contiguity/N50 by requiring too much evidence (n
/N
set too high).
I can't think of a good reason to use a larger n
than n=1
for the initial DistanceEst
step before SimpleGraph
.
I can't think of a good reason to use a larger n than n=1 for the initial DistanceEst step before SimpleGraph.
to reduce the size of the files output by DistanceEst
to use different values of n
for different libraries
So from version 2.1.0 I can run the mapping step with n=1 and then 'optimise' the n for SimpleGraph without re-doing the mapping, right.
Correct.
If N=10 is giving the best result, I'd suggest increasing your search range to N=1-20
.
How come that lower values for N don't end up with a more scaffolded result/stats?
That's a question that I'd love to know that global answer to. It's likely different for every organism and sequencing run. In short though, noise due to mispairing (much more common with mate-pair than with paired-end) and due to mismapping.
thanks both, got it now.
and the reason I start getting a few mentions of these kind :
warning: the head of 175727402+ does not match the tail of the previous contig
ACATGGAGCTAAGACA
catggagctaagACAT
is then likely due to stretching the parameters into the misassembly range.
ACATGGAGCTAAGACA
|||||||||||||||
catggagctaagACAT
In this case there is a perfect overlap between the two sequences, so that looks like a bug in ABySS. The result will be that the output contig will contain the sequence
ACATGGAGCTAAGACAncatggagctaagACAT
rather than the desired
ACATGGAGCTAAGACAT
You can likely fix this issue by polishing SNVs and indels with Pilon after assembly with ABySS, which is a good idea anyway.
If abyss-scaffold -N10
is producing the best scaffold result, I'd expect larger values of n
than -n1
to produce the best SimpleGraph results. Let us know how your parameter optimization goes.
OK, I ran the N=1-20 range and the optimum is indeed higher than 10, though not much. here is an extract of the parameter results:
50.54e6 3571755 328125 300 5477 13534 25359 17826 182097 16.04e9 n=9 s=200
50.5e6 3552470 326088 300 5558 13663 25538 17945 182097 16.06e9 n=10 s=200
50.5e6 3556710 326231 300 5585 13688 25546 17958 182097 16.08e9 n=11 s=200
50.54e6 3578176 327873 300 5573 13643 25439 17883 182097 16.09e9 n=12 s=200
50.6e6 3612593 330932 300 5536 13542 25214 17732 191771 16.11e9 n=13 s=200
Best scaffold N50 is 13688 at n=11 s=200.
n n:300 L50 min N75 N50 N25 E-size max sum name
50.5e6 3556710 326231 300 5585 13688 25546 17958 182097 16.08e9 n=11 s=200
indeed n=11 is the optimum, some of the higher ones produce large max, but since the selection is based on N50, this fits.
What I am I bit surprised from however is that the range for s goes from 200-2000, where I asked 300-1500.
abyss-scaffold -v -k85 -s300-1500 -n6-20 -g
What I am I bit surprised from however is that the range for s goes from 200-2000, where I asked 300-1500.
It always uses a 1,2,5 logarithmic scale, that is 1000, 2000, 5000, 10000, et c.
So I finished the n
optimization for the contig level.
In general I got the best results with a low n
for the DIstanceEst part and a larger one for the the SImpleGraph step. In all my test I was 'loosing" too much info if I increased n
for DistanceEst, the difference while changing n
for SimpleGraph were less outspoken but still apparent .
In my case n=1
for DistanceEst and n=5
for SimpleGraph gave the best N50 ( N25 N75 as well, only the max length was sometimes better for other combinations) .
I ran a similar analysis on a different dataset and that gave in general comparable results.
Unless all this is related to the fact I'm assembling some pretty huge genomes (>10Gb sizes) and data coverage is less than typically for smaller genomes?
Nonetheless, I think this opens an option for improving the verbose output. Just as it advises on different c
values, ABySS could inform about better (usually lower) DistanceEst n
values? (if you know the number of genomic unitigs and the number of unitig spanning PE reads you can get the theoretical average, no?
It always uses a 1,2,5 logarithmic scale, that is 1000, 2000, 5000, 10000, et c.
so what is the minimum value then? It seems to work with 200 but nothing below that, can that be? It does says for instance 175 as min in the assembly stats line but it also reports n=X s=200
in the N
optimisation table.
Yes, it's possible that ABySS could compute the physical fragment coverage and make recommendations for the value of n based on that. It's not on our radar though.
It's not recommended to use a value of s that is much less than 3 times the value of k. It tends to cause drastic over-assembly of the genome.
@lsterck Thanks for sharing your results and that makes sense about your optimal n
values at the contig stage (i.e. SimpleGraph
).
Considering this question answered and closing the ticket.
Considering this question answered and closing the ticket.
agreed and thanks for the support on this. I still hope that this 'thread' will inspire other people to have a look at this when doing assembly with ABySS as this in my case drastically improved the assembly result (as in orders of magnitude)
Wow. Glad to hear it. Could you report your assembly stats (as reported by abyss-fac
with your best guess of genome size -G
) before and after your assembly parameter tweaking?
OK, I'll see what I can find back (work has been spread out over quite a period :/ ) and I might have been a little over enthusiastic with my previous statement ;) , but there is remarkable improvement at least
Hi @sjackman ,
here is what I found back. one of the first ones I ran (with likel still also an older version of abyss (v1.9.0 ?) )
n n:200 L50 LG50 NG50 min N80 N50 N20 E-size max sum name
75.35e6 20.9e6 1584970 6406021 454 200 390 2393 7474 4293 67969 17.51e9 AllData_k85/ppinAll85-contigs.fa
the last one after tweaking (min length to use, n for libs, n for graph, ...); s=200, n_libs=1, n_graph=5
n n:200 L50 LG50 NG50 min N75 N50 N25 E-size max sum name
27.15e6 5830078 356522 557922 9118 200 4504 14119 28798 19875 222900 19.42e9 AllData_k85/ppinAll85-contigs.fa
the estimated genome size I've put at 24Gb (yes indeed, I do not get to that one for the moment)
Hi,
I'm on the contig/scaffolding stage of my assembly for which we can set the
n
parameter, affecting the number or pairs there are needy to link two sequences I was wondering how one should actually interpret this parameter. ifn=10
(default) does this then means that I need at least 10 from a single library or 10 from all libraries considered together? Since this is set in DistanceEst, I fear the former.What are the recommendations/interpretations for this? thanks.