bcgsc / abyss

:microscope: Assemble large genomes using short reads
http://www.bcgsc.ca/platform/bioinfo/software/abyss
Other
312 stars 108 forks source link

Performance of abyss-2.0.2 on a fish genome 1.7Gbp #187

Closed mmokrejs closed 6 years ago

mmokrejs commented 6 years ago

Hi, I wonder where I could upload few nice figures showing performance of abyss-2.0.2 on our server. It was run in MPI-enabled with OpenMP-enabled mode in steps where no OpenMP is support exists. Thanks to https://github.com/bcgsc/abyss/issues/185#issuecomment-363173408 .

abyss-pe v=-v np=104 j=104 k=128 ...

It has 3.2TB RAM and 112 CPU cores. I used only 104 CPUs. It finished after 1741 minutes wallclock time which is about 29hrs.

The read IO from our LustreFS could be much higher although I see the 104 CPU cores ran at full speed. If I use some of the tools counting k-mers or doing read error-correction from the BBmap bundle ( https://sourceforge.net/projects/bbmap ) I see much higher read IO. So this maybe reveals you could still do better during the initial steps of parsing input FASTQ files and splitting them into k-mers and counting them. But I do not have a hard proof and I did not bother to check whether abyss does only this.

In overall, I am quite happy, thank you for a nice tool. My real concern now is the precision of mate-pair mapping but that is another story. I see other github issues are opened about this.

abyss-2 0 2__cpu_usage abyss-2 0 2__infiniband_usage abyss-2 0 2__lustrefs_usage abyss-2 0 2__system_load abyss-2 0 2__memory_usage

I wanted to include some numbers of input read pairs, how many were discarded, average contig/unitig sizes etc., but it is well hidden in the logfile. So, no numbers.

mmokrejs commented 6 years ago
cat tt_16D1C3L12-3.fa tt_16D1C3L12-4.fa \
        |PathConsensus -v --dot -k128  -p0.9  -o tt_16D1C3L12-5.path -s tt_16D1C3L12-5.fa -g tt_16D1C3L12-5.dot - tt_16D1C3L12-4.dot tt_16D1C3L12-4.path3
Reading `tt_16D1C3L12-4.dot'...
Reading `-'...
Reading `tt_16D1C3L12-4.path3'...
Read 152688 paths
Ambiguous paths: 47661
Merged:          5735
No paths:        0
Too many paths:  11353
Too complex:     5974
Dissimilar:      24599
The minimum coverage of single-end contigs is 0.855932.
The minimum coverage of merged contigs is 0.855932.
n       n:200   L50     min     N80     N50     N20     E-size  max     sum     name
3011140 1623221 81178   200     333     3934    11026   6464    104425  1.38e9  tt_16D1C3L12-6.fa
ln -sf tt_16D1C3L12-6.fa tt_16D1C3L12-contigs.fa
PathConsensus -v --dot -k128  -p0.9  -s tt_16D1C3L12-7.fa -g tt_16D1C3L12-7.dot -o tt_16D1C3L12-7.path tt_16D1C3L12-6.fa tt_16D1C3L12-6.dot tt_16D1C3L12-6.path
Reading `tt_16D1C3L12-6.dot'...
Reading `tt_16D1C3L12-6.fa'...
Reading `tt_16D1C3L12-6.path'...
Read 21315 paths
Ambiguous paths: 111852
Merged:          6930
No paths:        85606
Too many paths:  1448
Too complex:     15805
Dissimilar:      2063
The minimum coverage of single-end contigs is 1.19492.
The minimum coverage of merged contigs is 3.23423.
Consider increasing the coverage threshold parameter, c, to 3.23423.
n       n:200   L50     min     N80     N50     N20     E-size  max     sum     name
2876731 1495511 9250    200     333     22846   127840  69342   793681  1.375e9 tt_16D1C3L12-8.fa
ln -sf tt_16D1C3L12-8.fa tt_16D1C3L12-scaffolds.fa
PathOverlap --overlap -v  -k128 --dot tt_16D1C3L12-7.dot tt_16D1C3L12-7.path >tt_16D1C3L12-8.dot
Reading `tt_16D1C3L12-7.dot'...
Reading `tt_16D1C3L12-7.path'...
ln -sf tt_16D1C3L12-8.dot tt_16D1C3L12-scaffolds.dot
abyss-fac   tt_16D1C3L12-unitigs.fa tt_16D1C3L12-contigs.fa tt_16D1C3L12-scaffolds.fa |tee tt_16D1C3L12-stats.tab
n       n:500   L50     min     N80     N50     N20     E-size  max     sum     name
3835048 475698  83169   500     1260    3192    6791    4462    56412   956e6   tt_16D1C3L12-unitigs.fa
3011140 310186  43902   500     2643    6415    13450   8757    104425  1.006e9 tt_16D1C3L12-contigs.fa
2876731 189115  4029    500     9361    56647   164847  94949   793681  1.003e9 tt_16D1C3L12-scaffolds.fa
ln -sf tt_16D1C3L12-stats.tab tt_16D1C3L12-stats
tr '\t' , <tt_16D1C3L12-stats.tab >tt_16D1C3L12-stats.csv
abyss-tabtomd tt_16D1C3L12-stats.tab >tt_16D1C3L12-stats.md
mmokrejs commented 6 years ago

Here is some statistics from various runs of BBMap_36.86/stats.sh in=tt_16D1C3L12-?.fa on the FASTA files:

$ for f in tt_16D1C3L12-?.stats; do echo $f; cat $f; done
tt_16D1C3L12-3.stats
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.3002  0.2001  0.2002  0.2995  0.0000  0.0000  0.0000  0.4003  0.0996

Main genome scaffold total:             3835048
Main genome contig total:               3835048
Main genome scaffold sequence total:    1651.869 MB
Main genome contig sequence total:      1651.869 MB     0.000% gap
Main genome scaffold N/L50:             280737/938
Main genome contig N/L50:               280737/938
Main genome scaffold N/L90:             2578210/142
Main genome contig N/L90:               2578210/142
Max scaffold length:                    56.412 KB
Max contig length:                      56.412 KB
Number of scaffolds > 50 KB:            1
% main genome in scaffolds > 50 KB:     0.00%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All          3,835,048       3,835,048   1,651,869,106   1,651,869,106   100.00%
    100          3,835,048       3,835,048   1,651,869,106   1,651,869,106   100.00%
    250          1,733,198       1,733,198   1,333,132,564   1,333,132,564   100.00%
    500            475,698         475,698     956,024,894     956,024,894   100.00%
   1 KB            267,415         267,415     813,134,944     813,134,944   100.00%
 2.5 KB            116,134         116,134     571,167,119     571,167,119   100.00%
   5 KB             38,065          38,065     299,800,293     299,800,293   100.00%
  10 KB              6,544           6,544      88,583,607      88,583,607   100.00%
  25 KB                125             125       3,756,838       3,756,838   100.00%
  50 KB                  1               1          56,412          56,412   100.00%

tt_16D1C3L12-6.stats
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.3001  0.2001  0.2002  0.2995  0.0018  0.0001  0.0000  0.4004  0.1019

Main genome scaffold total:             3011140
Main genome contig total:               3045311
Main genome scaffold sequence total:    1580.472 MB
Main genome contig sequence total:      1577.608 MB     0.181% gap
Main genome scaffold N/L50:             110528/2.838 KB
Main genome contig N/L50:               120623/2.592 KB
Main genome scaffold N/L90:             1850453/161
Main genome contig N/L90:               1891146/160
Max scaffold length:                    104.458 KB
Max contig length:                      66.423 KB
Number of scaffolds > 50 KB:            53
% main genome in scaffolds > 50 KB:     0.19%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All          3,011,140       3,045,311   1,580,471,679   1,577,607,982    99.82%
    100          3,011,140       3,045,311   1,580,471,679   1,577,607,982    99.82%
    250          1,395,308       1,429,479   1,330,002,425   1,327,138,728    99.78%
    500            310,246         344,286   1,008,724,822   1,005,877,736    99.72%
   1 KB            172,172         206,152     915,193,035     912,365,691    99.69%
 2.5 KB            122,397         150,934     821,859,910     819,446,210    99.71%
   5 KB             61,833          80,024     605,874,187     604,315,054    99.74%
  10 KB             20,351          28,293     316,990,308     316,304,244    99.78%
  25 KB              1,545           2,473      49,196,477      49,116,291    99.84%
  50 KB                 53             101       3,074,739       3,071,242    99.89%
 100 KB                  1               2         104,458         104,425    99.97%

tt_16D1C3L12-7.stats
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.3107  0.1897  0.1892  0.3104  0.0012  0.0036  0.0000  0.3789  0.0719

Main genome scaffold total:             772
Main genome contig total:               777
Main genome scaffold sequence total:    0.607 MB
Main genome contig sequence total:      0.606 MB    0.119% gap
Main genome scaffold N/L50:             187/1.089 KB
Main genome contig N/L50:               187/1.089 KB
Main genome scaffold N/L90:             576/371
Main genome contig N/L90:               582/370
Max scaffold length:                    3.685 KB
Max contig length:                      3.685 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All                772             777         607,189         606,469    99.88%
    100                772             777         607,189         606,469    99.88%
    250                761             766         604,828         604,108    99.88%
    500                430             435         485,414         484,694    99.85%
   1 KB                205             208         324,445         323,991    99.86%
 2.5 KB                 14              15          40,391          40,247    99.64%

tt_16D1C3L12-8.stats
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.3001  0.2002  0.2003  0.2994  0.0776  0.0001  0.0000  0.4005  0.1032

Main genome scaffold total:             2876731
Main genome contig total:               3015816
Main genome scaffold sequence total:    1703.964 MB
Main genome contig sequence total:      1571.653 MB     7.765% gap
Main genome scaffold N/L50:             11981/19.507 KB
Main genome contig N/L50:               108612/2.736 KB
Main genome scaffold N/L90:             1639503/171
Main genome contig N/L90:               1867631/160
Max scaffold length:                    873.806 KB
Max contig length:                      90.766 KB
Number of scaffolds > 50 KB:            5362
% main genome in scaffolds > 50 KB:     37.88%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All          2,876,731       3,015,816   1,703,964,196   1,571,652,579    92.24%
    100          2,876,731       3,015,816   1,703,964,196   1,571,652,579    92.24%
    250          1,268,766       1,407,851   1,454,714,840   1,322,403,223    90.90%
    500            189,163         328,132   1,135,210,242   1,002,913,588    88.35%
   1 KB             52,595         191,516   1,042,703,436     910,423,098    87.31%
 2.5 KB             32,095         169,028   1,006,465,234     874,336,179    86.87%
   5 KB             23,821         159,541     978,974,988     846,946,092    86.51%
  10 KB             17,880         148,430     935,974,349     809,391,570    86.48%
  25 KB              9,965         122,063     807,426,899     701,763,980    86.91%
  50 KB              5,362          93,327     645,398,771     565,717,262    87.65%
 100 KB              2,428          61,150     440,176,973     389,086,844    88.39%
 250 KB                382          17,275     130,980,817     117,113,064    89.41%
 500 KB                 28           2,206      17,115,130      15,343,274    89.65%
$
sjackman commented 6 years ago

My real concern now is the precision of mate-pair mapping but that is another story.

You can try aligner=bwamem if you prefer. The default is using abyss-map. If you do, I'd be curious to see the results of abyss-fac on both assemblies. Note that you don't have to rerun the MPI stage. You can resume the assembly from after the unitigs stage. Let us know if you'd like help with that.

sjackman commented 6 years ago

So this maybe reveals you could still do better during the initial steps of parsing input FASTQ files and splitting them into k-mers and counting them.

Do you know what proportion of time is spent loading and counting k-mers vs the total run time of ABYSS-P?

mmokrejs commented 6 years ago

I started same MPI-enabled job but with k=64 few hrs ago. The logfile at the moment keeps saying something like:

0: Read 84700000 reads. 0: Hash load: 57122315 / 268435456 = 0.213 using 2.71 GB
1: Read 84600000 reads. 1: Hash load: 56263364 / 268435456 = 0.21 using 2.67 GB
0: Read 84800000 reads. 0: Hash load: 57187818 / 268435456 = 0.213 using 2.71 GB
1: Read 84700000 reads. 1: Hash load: 56327180 / 268435456 = 0.21 using 2.67 GB

These two "0 and 1 workers" are because there are two paired-end dataset files, I remeber the logfile of the k=128 attempt mentioned them later. The RAM values are probably per single thread, so supposed to be multiplied by 104 in my case? See actually used RAM in figures below.

Here are runtime graphs since the job started 3.5hrs ago:

abyss-2 0 2__cpu_usage_hash_loading abyss-2 0 2__lustrefs_usage_hash_loading abyss-2 0 2__memory_usage_hash_loading

Please add timing information to your logs. Either current time or time since the job started, which would be maybe preferred (SPAdes also prints time since the job started). also, I would be happy if you prefixed each logged line with "Info: " or "Debug: " prefix. It would be much easier to grep through the file. It is tough to shrink down the number of lines once I enabled the verbose logging. Even better if you included on the line names of the methods/tools in action, like:

Info: abyss-map: ...
Info: DistanceEst: ...

To answer your question, I think this hash-filling phase finished when the memory usage was at maximum so it proceeded since 2/5/2018 19:33 to 2/6/2018 12:00 when the disk read IO went to zero (see the LustreFS chart in the original post).

sjackman commented 6 years ago

To answer your question, I think this hash-filling phase finished when the memory usage was at maximum so it proceeded since 2/5/2018 19:33 to 2/6/2018 12:00 when the disk read IO went to zero (see the LustreFS chart in the original post).

So how long was sequence loading (from start until peak memory) and how long was the total run time of ABYSS-P?

mmokrejs commented 6 years ago

So how long was sequence loading (from start until peak memory) and how long was the total run time of ABYSS-P?

987 minutes vs. 1741 minutes, wallclock.

sjackman commented 6 years ago

So loading is 57% of run time, which is a significant portion. Thanks for this helpful info.

benvvalk commented 6 years ago

@mmokrejs Thank you for taking the time to post your performance benchmarks. It is very interesting for us.

mmokrejs commented 6 years ago

You are welcome. Would you please comment how to interpret the numbers in https://github.com/bcgsc/abyss/issues/187#issuecomment-363741128 from the PathConsensus step? I mean, what does it tell me? How should I change k-mer size, or should bother with more thorough cleanup of the mate-pair reads from paired-end contaminants? Does it say something about the complexity of the genome, number of alleles? Or does it reveal number of somewhat error-free contigs/unitigs while the future scaffolding works will be just a gamble trying to order these golden pieces into any series? I am probably not going to ever open the .dot or .path files but is there a summary message somewhere teaching me something? I am sorry for the naive questions.

sjackman commented 6 years ago

To make inferences about the repeat complexity and heterozygosity of your genome, I recommend using ntCard and GenomeScope.

To optimize ABySS, I recommend trying different values of k and N, which make the biggest difference to the contiguity of the assembly. N only affects the very final stage of scaffolding, and so can be optimized quite quickly if you avoid rerunning the entire pipeline.

mmokrejs commented 6 years ago

To optimize ABySS, I recommend trying different values of k and N, which make the biggest difference to the contiguity of the assembly. N only affects the very final stage of scaffolding, and so can be optimized quite quickly if you avoid rerunning the entire pipeline.

Would you be more explicit what command do you mean or simply how to achieve what you say? I figured out you speak about abyss-pe -N $int ... scaffolds but should I delete the ${name}-scaffolds.fa or will it just create them with higher number?

sjackman commented 6 years ago

The command is abyss-pe N=10 … Grep the abyss-scaffold command out of your log file. Run it with different values of -n. The last line it reports is NG50. Select your favourite value of -n, say the one that maximizes NG50. Rerun abyss-scaffold with your favourite value of -n. Run abyss-pe --dry-run N=xxx … with your favourite value of N to confirm that it will only rerun the last few commands of the pipeline, then rerun abyss-pe N=xxx … to run those commands.

Note that abyss-scaffold takes the option -n10 whereas abyss-pe takes the option N=10

You could of course just run the entire pipeline abyss-pe N=10 … with different values of N, but the above, although more complicated, is much faster.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

mmokrejs commented 6 years ago
$ grep abyss-scaffold abyss_ecc.101473.log
abyss-scaffold -v -k128 -s1000-10000 -n10 ...

so I ran

abyss-scaffold -v -k128 -s100-10000 -n 3 -G 1267403131 ...
Reading `tt_16D1C3L12__abyss_128-6.dot'...
V=5049442 E=8506670 E/V=1.68
Degree: ▃█▅_
        01234
0: 18% 1: 42% 2-4: 37% 5+: 2.9% max: 910
Reading `HFYJ5AFXX.5kb.lmp-6.dist.dot'...
V=5049442 E=8762331 E/V=1.74
Degree: ▂█▅▁
        01234
0: 16% 1: 42% 2-4: 39% 5+: 3% max: 910
Reading `HFYJ5AFXX.8kb.lmp-6.dist.dot'...
V=5049442 E=8878485 E/V=1.76
Degree: ▂█▅▁
        01234
0: 16% 1: 42% 2-4: 39% 5+: 3% max: 910
Reading `HWFNLBCXY.lmp-6.dist.dot'...

...

n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2520923 2520923 105820  75262   3924    128 256 2874    9085    7139    157470  1.473e9 s=10000

Removed 4721356 vertices.
Removed 313393 edges.
V=328086 E=364252 E/V=1.11
Degree: ▂█▃
        01234
0: 17% 1: 57% 2-4: 26% 5+: 0.024% max: 27
Removed 264 cyclic edges.
V=328086 E=363988 E/V=1.11
Degree: ▂█▃
        01234
0: 18% 1: 57% 2-4: 26% 5+: 0.023% max: 27
Added 1432 edges to ambiguous vertices.
Removed 7372 tips.
V=313342 E=352108 E/V=1.12
Degree: ▁█▃
        01234
0: 16% 1: 57% 2-4: 26% 5+: 0.021% max: 25
Cleared 2245 ambiguous vertices.
Removed 131 ambiguous vertices.
V=313080 E=340742 E/V=1.09
Degree: ▂█▂
        01234
0: 17% 1: 58% 2-4: 25% 5+: 0.0042% max: 21
Removed 75120 transitive edges.
V=313080 E=265622 E/V=0.848
Degree: ▁█
        01234
0: 17% 1: 80% 2-4: 2.1% 5+: 0.0026% max: 20
Removed 3531 tips.
V=306018 E=258560 E/V=0.845
Degree: ▁█
        01234
0: 17% 1: 82% 2-4: 1.4% 5+: 0.002% max: 20
Removed 7081 vertices in bubbles.
V=298934 E=248126 E/V=0.83
Degree: ▁█
        01234
0: 17% 1: 82% 2-4: 0.28% 5+: 0.002% max: 20
Removed 138 weak edges.
V=298934 E=247988 E/V=0.83
Degree: ▁█
        01234
0: 17% 1: 82% 2-4: 0.23% 5+: 0.002% max: 20
Assembled 144940 contigs in 22216 scaffolds.
V=298934 E=247988 E/V=0.83
Degree: ▁█
        01234
0: 17% 1: 82% 2-4: 0.23% 5+: 0.002% max: 20
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
Best scaffold N50 is 11461 at s=1000.

So what the above really says to me? For example 0: 18% 1: 42% 2-4: 37% 5+: 2.9% max: 910 means that 42% of links exist due to only 2-4 mate-pairs?

Aha, you probably wanted me to show the non-verbose output.

$ abyss-scaffold -k128 -s100-10000 -n 2 -G 1267403131 -g tt_16D1C3L12__abyss_128-6.path.dot  tt_16D1C3L12__abyss_128-6.dot HFYJ5AFXX.5kb.lmp-6.dist.dot HFYJ5AFXX.8kb.lmp-6.dist.dot HWFNLBCXY.lmp-6.dist.dot HFYJ5AFXX.5kb.unknown-6.dist.dot HFYJ5AFXX.8kb.unknown-6.dist.dot HWFNLBCXY.2.unknown-6.dist.dot HFYJ5AFXX.5kb.fragments-6.dist.dot HFYJ5AFXX.8kb.fragments-6.dist.dot HWFNLBCXY.2.fragments-6.dist.dot >tt_16D1C3L12__abyss_128-6.path
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2396903 2396903 15550   9207    22270   128 256 10926   99549   53585   905173  1.467e9 s=100
2396959 2396959 15547   9207    22273   128 256 10931   99549   53568   905173  1.467e9 s=200
2397391 2397391 15487   9181    22373   128 256 10992   99500   53778   905146  1.467e9 s=500
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
2416139 2416139 17217   10037   20187   128 256 9310    91419   49894   722459  1.468e9 s=2000
2495803 2495803 81007   50578   3906    128 256 2865    24673   15378   258793  1.471e9 s=5000
2520923 2520923 105820  75262   3924    128 256 2874    9085    7139    157470  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
Best scaffold N50 is 11461 at s=1000.
$ abyss-scaffold -k128 -s100-10000 -n 3 -G 1267403131 -g tt_16D1C3L12__abyss_128-6.path.dot  tt_16D1C3L12__abyss_128-6.dot HFYJ5AFXX.5kb.lmp-6.dist.dot HFYJ5AFXX.8kb.lmp-6.dist.dot HWFNLBCXY.lmp-6.dist.dot HFYJ5AFXX.5kb.unknown-6.dist.dot HFYJ5AFXX.8kb.unknown-6.dist.dot HWFNLBCXY.2.unknown-6.dist.dot HFYJ5AFXX.5kb.fragments-6.dist.dot HFYJ5AFXX.8kb.fragments-6.dist.dot HWFNLBCXY.2.fragments-6.dist.dot >tt_16D1C3L12__abyss_128-6.path
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2396903 2396903 15550   9207    22270   128 256 10926   99549   53585   905173  1.467e9 s=100
2396959 2396959 15547   9207    22273   128 256 10931   99549   53568   905173  1.467e9 s=200
2397391 2397391 15487   9181    22373   128 256 10992   99500   53778   905146  1.467e9 s=500
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
2416139 2416139 17217   10037   20187   128 256 9310    91419   49894   722459  1.468e9 s=2000
2495803 2495803 81007   50578   3906    128 256 2865    24673   15378   258793  1.471e9 s=5000
2520923 2520923 105820  75262   3924    128 256 2874    9085    7139    157470  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
Best scaffold N50 is 11461 at s=1000.
$ abyss-scaffold -k128 -s100-10000 -n 5 -G 1267403131 -g tt_16D1C3L12__abyss_128-6.path.dot  tt_16D1C3L12__abyss_128-6.dot HFYJ5AFXX.5kb.lmp-6.dist.dot HFYJ5AFXX.8kb.lmp-6.dist.dot HWFNLBCXY.lmp-6.dist.dot HFYJ5AFXX.5kb.unknown-6.dist.dot HFYJ5AFXX.8kb.unknown-6.dist.dot HWFNLBCXY.2.unknown-6.dist.dot HFYJ5AFXX.5kb.fragments-6.dist.dot HFYJ5AFXX.8kb.fragments-6.dist.dot HWFNLBCXY.2.fragments-6.dist.dot >tt_16D1C3L12__abyss_128-6.path
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2396903 2396903 15550   9207    22270   128 256 10926   99549   53585   905173  1.467e9 s=100
2396959 2396959 15547   9207    22273   128 256 10931   99549   53568   905173  1.467e9 s=200
2397391 2397391 15487   9181    22373   128 256 10992   99500   53778   905146  1.467e9 s=500
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
2416139 2416139 17217   10037   20187   128 256 9310    91419   49894   722459  1.468e9 s=2000
2495803 2495803 81007   50578   3906    128 256 2865    24673   15378   258793  1.471e9 s=5000
2520923 2520923 105820  75262   3924    128 256 2874    9085    7139    157470  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
Best scaffold N50 is 11461 at s=1000.
$ abyss-scaffold -k128 -s100-10000 -n 7 -G 1267403131 -g tt_16D1C3L12__abyss_128-6.path.dot  tt_16D1C3L12__abyss_128-6.dot HFYJ5AFXX.5kb.lmp-6.dist.dot HFYJ5AFXX.8kb.lmp-6.dist.dot HWFNLBCXY.lmp-6.dist.dot HFYJ5AFXX.5kb.unknown-6.dist.dot HFYJ5AFXX.8kb.unknown-6.dist.dot HWFNLBCXY.2.unknown-6.dist.dot HFYJ5AFXX.5kb.fragments-6.dist.dot HFYJ5AFXX.8kb.fragments-6.dist.dot HWFNLBCXY.2.fragments-6.dist.dot >tt_16D1C3L12__abyss_128-6.path
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2396903 2396903 15550   9207    22270   128 256 10926   99549   53585   905173  1.467e9 s=100
2396959 2396959 15547   9207    22273   128 256 10931   99549   53568   905173  1.467e9 s=200
2397391 2397391 15487   9181    22373   128 256 10992   99500   53778   905146  1.467e9 s=500
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
2416139 2416139 17217   10037   20187   128 256 9310    91419   49894   722459  1.468e9 s=2000
2495803 2495803 81007   50578   3906    128 256 2865    24673   15378   258793  1.471e9 s=5000
2520923 2520923 105820  75262   3924    128 256 2874    9085    7139    157470  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
Best scaffold N50 is 11461 at s=1000.
$ abyss-scaffold -k128 -s100-10000 -n 10 -G 1267403131 -g tt_16D1C3L12__abyss_128-6.path.dot  tt_16D1C3L12__abyss_128-6.dot HFYJ5AFXX.5kb.lmp-6.dist.dot HFYJ5AFXX.8kb.lmp-6.dist.dot HWFNLBCXY.lmp-6.dist.dot HFYJ5AFXX.5kb.unknown-6.dist.dot HFYJ5AFXX.8kb.unknown-6.dist.dot HWFNLBCXY.2.unknown-6.dist.dot HFYJ5AFXX.5kb.fragments-6.dist.dot HFYJ5AFXX.8kb.fragments-6.dist.dot HWFNLBCXY.2.fragments-6.dist.dot >tt_16D1C3L12__abyss_128-6.path
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2396903 2396903 15550   9207    22270   128 256 10926   99549   53585   905173  1.467e9 s=100
2396959 2396959 15547   9207    22273   128 256 10931   99549   53568   905173  1.467e9 s=200
2397391 2397391 15487   9181    22373   128 256 10992   99500   53778   905146  1.467e9 s=500
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
2416139 2416139 17217   10037   20187   128 256 9310    91419   49894   722459  1.468e9 s=2000
2495803 2495803 81007   50578   3906    128 256 2865    24673   15378   258793  1.471e9 s=5000
2520923 2520923 105820  75262   3924    128 256 2874    9085    7139    157470  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2401997 2401997 15073   8978    23050   128 256 11461   100807  55075   904172  1.467e9 s=1000
Best scaffold N50 is 11461 at s=1000.

I do not see a difference based on the -n 10 parameter which has been used by default (according to the log file):

$ abyss-scaffold --help
Usage: abyss-scaffold -k<kmer> [OPTION]... FASTA|OVERLAP DIST...
Scaffold contigs using the distance estimate graph.

 Arguments:

  FASTA    contigs in FASTA format
  OVERLAP  the contig overlap graph
  DIST     estimates of the distance between contigs

 Options:

  -n, --npairs=N        minimum number of pairs [0]
  -s, --seed-length=N   minimum contig length [200]
          or -s N0-N1   Find the value of s in [N0,N1]
                        that maximizes the scaffold N50.

I am close to say it makes no sense to continue with the abyss-pe reusing the same value 10 but I realized from the log file it was ran without N=... altogether. But is the default valu 10 also inside abyss-pe?

I backed up previous scaffolds

$ myprefix="tt_16D1C3L12__abyss_128"; for p in ${myprefix}-7.fa ${myprefix}-7.dot ${myprefix}-7.path ${myprefix}-8.fa ${myprefix}-stats.tab ${myprefix}-stats.csv ${myprefix}-stats.md; do cp -p $f ${f}.ori; done

but I will better wait for your answer first. Thank you.

sjackman commented 6 years ago

So what the above really says to me? For example 0: 18% 1: 42% 2-4: 37% 5+: 2.9% max: 910 means that 42% of links exist due to only 2-4 mate-pairs?

No, that's a histogram of vertex degree, that is, the number of edges incident to each vertex.

sjackman commented 6 years ago

But is the default valu 10 also inside abyss-pe?

Yes, the default is N=10.

Interesting that for this data set all values of n between 2 and 10 seem to perform equally well. I'd suggest trying larger values of n. Increase n until the result changes. If it gets better, great! If it gets worse, stick with N=10.

mmokrejs commented 6 years ago

Yes, the values are just worse.

$ abyss-scaffold -k128 -s100-10000 -n 12 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2401155 2401155 16729   10026   20831   128 256 10424   89879   48904   589188  1.467e9 s=100
2401191 2401191 16731   10027   20831   128 256 10424   89879   48889   589188  1.467e9 s=200
2401401 2401401 16707   10016   20872   128 256 10448   89985   49053   590846  1.467e9 s=500
2405031 2405031 16507   9914    21198   128 256 10655   90308   49316   589963  1.467e9 s=1000
2418661 2418661 18767   11032   18634   128 256 8721    82174   45305   606894  1.468e9 s=2000
2496590 2496590 81794   51342   3907    128 256 2865    23767   14777   258793  1.471e9 s=5000
2521062 2521062 105959  75373   3925    128 256 2874    9085    7047    157470  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2405031 2405031 16507   9914    21198   128 256 10655   90308   49316   589963  1.467e9 s=1000
Best scaffold N50 is 10655 at s=1000.
$ abyss-scaffold -k128 -s100-10000 -n 15 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2406177 2406177 18502   11187   19034   128 256 9619    80263   44058   586841  1.467e9 s=100
2406201 2406201 18502   11187   19034   128 256 9621    80263   44059   586841  1.467e9 s=200
2406357 2406357 18496   11187   19049   128 256 9630    80124   44104   586841  1.467e9 s=500
2409057 2409057 18449   11165   19094   128 256 9677    80092   44039   590638  1.467e9 s=1000
2422104 2422104 20912   12376   16884   128 256 7976    73112   40264   588866  1.468e9 s=2000
2497589 2497589 82753   52311   3908    128 256 2866    22557   14084   258793  1.472e9 s=5000
2521197 2521197 106094  75508   3925    128 256 2874    9085    6961    157470  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2409057 2409057 18449   11165   19094   128 256 9677    80092   44039   590638  1.467e9 s=1000
Best scaffold N50 is 9677 at s=1000.
$ abyss-scaffold -k128 -s100-10000 -n 17 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2408657 2408657 19586   11905   18045   128 256 9230    75095   41104   543238  1.467e9 s=100
2408685 2408685 19585   11905   18045   128 256 9232    75095   41104   543238  1.467e9 s=200
2408812 2408812 19585   11906   18047   128 256 9232    75065   41149   542938  1.467e9 s=500
2411200 2411200 19566   11904   18100   128 256 9241    74783   41012   541994  1.467e9 s=1000
2424032 2424032 22157   13175   15972   128 256 7590    68550   37632   540451  1.468e9 s=2000
2498128 2498128 83292   52850   3908    128 256 2866    22022   13703   258793  1.472e9 s=5000
2521269 2521269 106166  75580   3925    128 256 2874    9085    6924    157470  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2411200 2411200 19566   11904   18100   128 256 9241    74783   41012   541994  1.467e9 s=1000
Best scaffold N50 is 9241 at s=1000.
$ abyss-scaffold -k128 -s100-10000 -n 20 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2411397 2411397 20979   12799   16854   128 256 8720    69078   38113   517862  1.467e9 s=100
2411430 2411430 20976   12799   16854   128 256 8722    69078   38113   517862  1.467e9 s=200
2411537 2411537 20977   12798   16855   128 256 8722    69046   38155   517862  1.467e9 s=500
2413830 2413830 20979   12805   16899   128 256 8713    68998   38038   523230  1.468e9 s=1000
2426391 2426391 23717   14144   14929   128 256 7129    63386   34974   521458  1.468e9 s=2000
2498885 2498885 84049   53607   3908    128 256 2866    21209   13237   258793  1.472e9 s=5000
2521373 2521373 106270  75684   3925    128 256 2874    9085    6869    146598  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2411430 2411430 20976   12799   16854   128 256 8722    69078   38113   517862  1.467e9 s=200
Best scaffold N50 is 8722 at s=200.
$ abyss-scaffold -k128 -s100-10000 -n 23 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2414407 2414407 22313   13668   15955   128 256 8278    64078   35839   524054  1.468e9 s=100
2414436 2414436 22313   13668   15955   128 256 8277    64078   35839   524054  1.468e9 s=200
2414531 2414531 22315   13671   15961   128 256 8277    64048   35833   524054  1.468e9 s=500
2416422 2416422 22344   13686   15934   128 256 8243    63914   35690   523230  1.468e9 s=1000
2428552 2428552 25168   15070   14101   128 256 6773    59272   32728   521458  1.468e9 s=2000
2499642 2499642 84806   54364   3908    128 256 2866    20492   12782   258793  1.472e9 s=5000
2521459 2521459 106356  75770   3925    128 256 2874    9085    6820    146598  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2414407 2414407 22313   13668   15955   128 256 8278    64078   35839   524054  1.468e9 s=100
Best scaffold N50 is 8278 at s=100.
$ abyss-scaffold -k128 -s100-10000 -n 25 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2416621 2416621 23270   14279   15369   128 256 7977    61560   34179   489346  1.468e9 s=100
2416641 2416641 23270   14279   15369   128 256 7977    61560   34179   489346  1.468e9 s=200
2416716 2416716 23274   14280   15369   128 256 7976    61553   34177   489346  1.468e9 s=500
2418320 2418320 23313   14305   15342   128 256 7949    61394   34035   489346  1.468e9 s=1000
2430063 2430063 26159   15701   13599   128 256 6568    56777   31347   484560  1.468e9 s=2000
2500101 2500101 85265   54823   3908    128 256 2866    20117   12512   258793  1.472e9 s=5000
2521507 2521507 106404  75818   3925    128 256 2874    9085    6792    146598  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2416621 2416621 23270   14279   15369   128 256 7977    61560   34179   489346  1.468e9 s=100
Best scaffold N50 is 7977 at s=100.
$ abyss-scaffold -k128 -s100-10000 -n 28 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2420026 2420026 24634   15126   14467   128 256 7540    57644   32379   489346  1.468e9 s=100
2420042 2420042 24634   15124   14468   128 256 7540    57644   32379   489346  1.468e9 s=200
2420101 2420101 24636   15127   14467   128 256 7540    57644   32380   489346  1.468e9 s=500
2421285 2421285 24671   15147   14446   128 256 7524    57459   32270   489346  1.468e9 s=1000
2432260 2432260 27527   16550   12951   128 256 6276    53538   29739   484560  1.468e9 s=2000
2500728 2500728 85892   55418   3909    128 256 2866    19568   12223   258793  1.472e9 s=5000
2521566 2521566 106463  75877   3925    128 256 2874    9085    6755    146598  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2420026 2420026 24634   15126   14467   128 256 7540    57644   32379   489346  1.468e9 s=100
Best scaffold N50 is 7540 at s=100.
$ abyss-scaffold -k128 -s100-10000 -n 30 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2422212 2422212 25556   15712   13998   128 256 7245    55383   31216   438641  1.468e9 s=100
2422227 2422227 25556   15712   13998   128 256 7245    55383   31216   438641  1.468e9 s=200
2422278 2422278 25558   15713   13998   128 256 7244    55372   31216   438641  1.468e9 s=500
2423258 2423258 25591   15731   13982   128 256 7226    55303   31148   438641  1.468e9 s=1000
2433673 2433673 28412   17114   12591   128 256 6089    51685   28736   388962  1.468e9 s=2000
2501115 2501115 86279   55805   3909    128 256 2866    19255   12008   258793  1.472e9 s=5000
2521610 2521610 106507  75921   3925    128 256 2874    9085    6730    146598  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2422212 2422212 25556   15712   13998   128 256 7245    55383   31216   438641  1.468e9 s=100
Best scaffold N50 is 7245 at s=100.
$ abyss-scaffold -k128 -s100-10000 -n 33 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2425580 2425580 27035   16622   13246   128 256 6801    52194   29646   438641  1.468e9 s=100
2425594 2425594 27035   16622   13246   128 256 6801    52194   29646   438641  1.468e9 s=200
2425630 2425630 27036   16622   13247   128 256 6802    52194   29645   438641  1.468e9 s=500
2426410 2426410 27075   16639   13235   128 256 6776    52084   29591   438641  1.468e9 s=1000
2435904 2435904 29822   17988   12016   128 256 5803    48650   27440   388962  1.468e9 s=2000
2501703 2501703 86832   56373   3910    128 256 2867    18759   11695   258793  1.472e9 s=5000
2521681 2521681 106578  75992   3925    128 256 2874    9085    6692    146598  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2425630 2425630 27036   16622   13247   128 256 6802    52194   29645   438641  1.468e9 s=500
Best scaffold N50 is 6802 at s=500.
$ abyss-scaffold -k128 -s100-10000 -n 37 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2429647 2429647 28902   17756   12459   128 256 6337    48756   27712   438641  1.468e9 s=100
2429660 2429660 28902   17756   12459   128 256 6337    48756   27712   438641  1.468e9 s=200
2429692 2429692 28904   17757   12460   128 256 6337    48756   27710   438641  1.468e9 s=500
2430252 2430252 28931   17770   12444   128 256 6329    48739   27668   438641  1.468e9 s=1000
2438755 2438755 31606   19064   11393   128 256 5380    45863   25907   388962  1.469e9 s=2000
2502349 2502349 87478   57019   3910    128 256 2867    18213   11346   258793  1.472e9 s=5000
2521775 2521775 106672  76086   3925    128 256 2874    9086    6632    146598  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2429647 2429647 28902   17756   12459   128 256 6337    48756   27712   438641  1.468e9 s=100
Best scaffold N50 is 6337 at s=100.
$ abyss-scaffold -k128 -s100-10000 -n 42 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2434921 2434921 31449   19232   11473   128 256 5646    45012   25712   434149  1.468e9 s=100
2434934 2434934 31449   19232   11473   128 256 5646    45012   25712   434149  1.468e9 s=200
2434958 2434958 31447   19232   11473   128 256 5647    45012   25712   434149  1.468e9 s=500
2435302 2435302 31472   19242   11471   128 256 5638    44967   25701   434149  1.468e9 s=1000
2442448 2442448 33990   20405   10651   128 256 4804    42721   24346   388962  1.469e9 s=2000
2503007 2503007 88136   57677   3910    128 256 2867    17748   11016   258793  1.472e9 s=5000
2521861 2521861 106758  76172   3925    128 256 2874    9086    6583    146598  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2434958 2434958 31447   19232   11473   128 256 5647    45012   25712   434149  1.468e9 s=500
Best scaffold N50 is 5647 at s=500.
$ abyss-scaffold -k128 -s100-10000 -n 46 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2439093 2439093 33648   20462   10772   128 256 5035    42248   24208   371546  1.469e9 s=100
2439102 2439102 33648   20462   10772   128 256 5035    42248   24208   371546  1.469e9 s=200
2439119 2439119 33647   20465   10771   128 256 5037    42248   24208   371546  1.469e9 s=500
2439337 2439337 33668   20471   10767   128 256 5029    42248   24200   371546  1.469e9 s=1000
2445432 2445432 36041   21509   10118   128 256 4366    40536   23173   304925  1.469e9 s=2000
2503463 2503463 88592   58133   3910    128 256 2867    17437   10801   258793  1.472e9 s=5000
2521913 2521913 106810  76224   3925    128 256 2874    9086    6547    137016  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2439119 2439119 33647   20465   10771   128 256 5037    42248   24208   371546  1.469e9 s=500
Best scaffold N50 is 5037 at s=500.
$ abyss-scaffold -k128 -s100-10000 -n 50 -G 1267403131 ...
warning: Removed 5458 invalid edges.
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2443246 2443246 36057   21690   10157   128 256 4443    39781   22920   344974  1.469e9 s=100
2443254 2443254 36057   21690   10157   128 256 4443    39781   22920   344974  1.469e9 s=200
2443270 2443270 36057   21690   10157   128 256 4444    39781   22920   344974  1.469e9 s=500
2443397 2443397 36071   21696   10156   128 256 4440    39767   22913   344974  1.469e9 s=1000
2448397 2448397 38210   22592   9586    128 256 4040    38564   22158   343577  1.469e9 s=2000
2503831 2503831 88960   58501   3910    128 256 2867    17171   10644   258793  1.472e9 s=5000
2521961 2521961 106858  76272   3925    128 256 2874    9086    6526    137016  1.473e9 s=10000
n   n:100   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
2443270 2443270 36057   21690   10157   128 256 4444    39781   22920   344974  1.469e9 s=500
Best scaffold N50 is 4444 at s=500.
mmokrejs commented 6 years ago
$ abyss-pe FIXMATE_OPTIONS=--qname N=10 ...
make: Nothing to be done for 'pe-bam'.
make: Nothing to be done for 'mp-bam'.
PathConsensus -v --dot -k128  -p0.9  -s tt_16D1C3L12__abyss_128-7.fa -g tt_16D1C3L12__abyss_128-7.dot -o tt_16D1C3L12__abyss_128-7.path tt_16D1C3L12__abyss_128-6.fa tt_16D1C3L12__abyss_128-6.dot tt_16D1C3L12__abyss_128-6.path
Reading `tt_16D1C3L12__abyss_128-6.dot'...
Reading `tt_16D1C3L12__abyss_128-6.fa'...
Reading `tt_16D1C3L12__abyss_128-6.path'...
Read 27839 paths
Ambiguous paths: 74909
Merged:          5953
No paths:        57621
Too many paths:  1236
Too complex:     8452
Dissimilar:      1647
cat tt_16D1C3L12__abyss_128-6.fa tt_16D1C3L12__abyss_128-7.fa \
    |MergeContigs -v  -k128 -o tt_16D1C3L12__abyss_128-8.fa - tt_16D1C3L12__abyss_128-7.dot tt_16D1C3L12__abyss_128-7.path
Reading `tt_16D1C3L12__abyss_128-7.dot'...
Read 5051014 vertices. Using 697 MB of memory.
Reading `-'...
Read 2525507 sequences. Using 2.52 GB of memory.
Reading `tt_16D1C3L12__abyss_128-7.path'...
Read 31677 paths. Using 2.52 GB of memory.
warning: the head of 7574481+ does not match the tail of the previous contig
AAATAACGACTGTTGGGATTTACTAAAGACGCGCAATTGATCATTAGTGCTGAAAAGGTGTGGTCTACACTGTAAAACCTAACAGTTAAATCATCTCAAACCATTTAAGGAAATCGGTTGCCTTAAA
ggttgccttAAAccgTttaagttttaAAMSACWKTTGaGKAcTKTgaACTWRAGtaAYGYGCAATtaTGcAcTYATatttaagTAGTGtgaaCTKAAAtatARGTGcataacTGcRTMTgACWCaat
7445495+ 675N 7420534- 7574481+ 7543010-
The minimum coverage of single-end contigs is 1.22115.
The minimum coverage of merged contigs is 3.81982.
Consider increasing the coverage threshold parameter, c, to 3.81982.
n   n:200   L50 min N80 N50 N20 E-size  max sum name
2429899 1478929 24851   200 340 8509    44500   25427   346751  1.329e9 tt_16D1C3L12__abyss_128-8.fa
time user=0.00s system=3.31s elapsed=127.47s cpu=2% memory=4 job=
time user=107.04s system=42.52s elapsed=154.07s cpu=97% memory=2189 job=
ln -sf tt_16D1C3L12__abyss_128-8.fa tt_16D1C3L12__abyss_128-scaffolds.fa
PathOverlap --overlap -v  -k128 --dot tt_16D1C3L12__abyss_128-7.dot tt_16D1C3L12__abyss_128-7.path >tt_16D1C3L12__abyss_128-8.dot
Reading `tt_16D1C3L12__abyss_128-7.dot'...
Reading `tt_16D1C3L12__abyss_128-7.path'...
ln -sf tt_16D1C3L12__abyss_128-8.dot tt_16D1C3L12__abyss_128-scaffolds.dot
abyss-fac   tt_16D1C3L12__abyss_128-unitigs.fa tt_16D1C3L12__abyss_128-contigs.fa tt_16D1C3L12__abyss_128-scaffolds.fa |tee tt_16D1C3L12__abyss_128-stats.tab
n   n:500   L50 min N80 N50 N20 E-size  max sum name
3282085 479704  87706   500 1203    2936    6202    4112    57779   921.7e6 tt_16D1C3L12__abyss_128-unitigs.fa
2524721 320183  47968   500 2439    5684    11904   7801    84004   972.2e6 tt_16D1C3L12__abyss_128-contigs.fa
2429899 237545  11119   500 2982    19882   57895   34719   346751  970e6   tt_16D1C3L12__abyss_128-scaffolds.fa
time user=21.22s system=7.00s elapsed=28.88s cpu=97% memory=4 job=
time user=0.00s system=0.01s elapsed=28.88s cpu=0% memory=0 job=
tr '\t' , <tt_16D1C3L12__abyss_128-stats.tab >tt_16D1C3L12__abyss_128-stats.csv
abyss-tabtomd tt_16D1C3L12__abyss_128-stats.tab >tt_16D1C3L12__abyss_128-stats.md

Something went wrong I guess:

--- old.stats   2018-03-30 18:46:41.000000000 +0200
+++ new.stats   2018-03-30 18:46:27.000000000 +0200
@@ -1,9 +1,9 @@
 n  n:500   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
 7278490    540152  103434  200909  1417    500 1008    2430    5164    3450    42126   905.4e6 tt_16D1C3L12__abyss_128-1.fa
 3606504    540152  103434  200909  1417    500 1008    2430    5164    3450    42126   905.4e6 tt_16D1C3L12__abyss_128-2.fa
 3282085    479704  87706   162732  1801    500 1203    2936    6202    4112    57779   921.7e6 tt_16D1C3L12__abyss_128-3.fa
 8089   0   0   0   0   0   0   0   0   0   0   0   tt_16D1C3L12__abyss_128-4.fa
 5252   0   0   0   0   0   0   0   0   0   0   0   tt_16D1C3L12__abyss_128-5.fa
 2524721    320183  47968   79411   3892    500 2439    5684    11904   7801    84004   972.2e6 tt_16D1C3L12__abyss_128-6.fa
-1093   563 182 563 503 503 784 1227    1854    1371    3813    624124  tt_16D1C3L12__abyss_128-7.fa
-2383961    195687  4463    8913    23272   500 7332    48678   144350  83684   910150  968.8e6 tt_16D1C3L12__abyss_128-8.fa
+786    374 122 374 503 503 755 1174    1755    1287    2911    396799  tt_16D1C3L12__abyss_128-7.fa
+2429899    237545  11119   21569   10231   500 2982    19882   57895   34719   346751  970e6   tt_16D1C3L12__abyss_128-8.fa

Provided abyss-pe --help outputs the help text of GNU make I am a bit lost how to pass-down the parameter c to resolve this:

The minimum coverage of single-end contigs is 1.22115.
The minimum coverage of merged contigs is 3.81982.
Consider increasing the coverage threshold parameter, c, to 3.81982.
sjackman commented 6 years ago
abyss-pe help
man abyss-pe

https://github.com/bcgsc/abyss/#assembly-parameters

abyss-pe c=4 …

ABySS picks the values for c and e automatically based on the data. It reports much earlier in the log the values that it picked.

mmokrejs commented 6 years ago

Aha, I tried meanwhile blindly

$ abyss-pe FIXMATE_OPTIONS=--qname N=10 c=3.81982 v=-v np=104 j=104 k=128  ...
make: Nothing to be done for 'pe-bam'.
make: Nothing to be done for 'mp-bam'.
PathConsensus -v --dot -k128  -p0.9  -s tt_16D1C3L12__abyss_128-7.fa -g tt_16D1C3L12__abyss_128-7.dot -o tt_16D1C3L12__abyss_128-7.path tt_16D1C3L12__abyss_128-6.fa tt_16D1C3L12__abyss_128-6.dot tt_16D1C3L12__abyss_128-6.path
Reading `tt_16D1C3L12__abyss_128-6.dot'...
Reading `tt_16D1C3L12__abyss_128-6.fa'...
Reading `tt_16D1C3L12__abyss_128-6.path'...
Read 27839 paths
Ambiguous paths: 74909
Merged:          5953
No paths:        57621
Too many paths:  1236
Too complex:     8452
Dissimilar:      1647
cat tt_16D1C3L12__abyss_128-6.fa tt_16D1C3L12__abyss_128-7.fa \
    |MergeContigs -v  -k128 -o tt_16D1C3L12__abyss_128-8.fa - tt_16D1C3L12__abyss_128-7.dot tt_16D1C3L12__abyss_128-7.path
Reading `tt_16D1C3L12__abyss_128-7.dot'...
Read 5051014 vertices. Using 697 MB of memory.
Reading `-'...
Read 2525507 sequences. Using 2.52 GB of memory.
Reading `tt_16D1C3L12__abyss_128-7.path'...
Read 31677 paths. Using 2.52 GB of memory.
warning: the head of 7574481+ does not match the tail of the previous contig
AAATAACGACTGTTGGGATTTACTAAAGACGCGCAATTGATCATTAGTGCTGAAAAGGTGTGGTCTACACTGTAAAACCTAACAGTTAAATCATCTCAAACCATTTAAGGAAATCGGTTGCCTTAAA
ggttgccttAAAccgTttaagttttaAAMSACWKTTGaGKAcTKTgaACTWRAGtaAYGYGCAATtaTGcAcTYATatttaagTAGTGtgaaCTKAAAtatARGTGcataacTGcRTMTgACWCaat
7445495+ 675N 7420534- 7574481+ 7543010-
The minimum coverage of single-end contigs is 1.22115.
The minimum coverage of merged contigs is 3.81982.
Consider increasing the coverage threshold parameter, c, to 3.81982.
n   n:200   L50 min N80 N50 N20 E-size  max sum name
2429899 1478929 24851   200 340 8509    44500   25427   346751  1.329e9 tt_16D1C3L12__abyss_128-8.fa
time user=0.00s system=100.41s elapsed=567.83s cpu=17% memory=4 job=
time user=105.20s system=483.96s elapsed=589.66s cpu=99% memory=2189 job=
ln -sf tt_16D1C3L12__abyss_128-8.fa tt_16D1C3L12__abyss_128-scaffolds.fa
PathOverlap --overlap -v  -k128 --dot tt_16D1C3L12__abyss_128-7.dot tt_16D1C3L12__abyss_128-7.path >tt_16D1C3L12__abyss_128-8.dot
Reading `tt_16D1C3L12__abyss_128-7.dot'...
Reading `tt_16D1C3L12__abyss_128-7.path'...
ln -sf tt_16D1C3L12__abyss_128-8.dot tt_16D1C3L12__abyss_128-scaffolds.dot
abyss-fac   tt_16D1C3L12__abyss_128-unitigs.fa tt_16D1C3L12__abyss_128-contigs.fa tt_16D1C3L12__abyss_128-scaffolds.fa |tee tt_16D1C3L12__abyss_128-stats.tab
n   n:500   L50 min N80 N50 N20 E-size  max sum name
3282085 479704  87706   500 1203    2936    6202    4112    57779   921.7e6 tt_16D1C3L12__abyss_128-unitigs.fa
2524721 320183  47968   500 2439    5684    11904   7801    84004   972.2e6 tt_16D1C3L12__abyss_128-contigs.fa
2429899 237545  11119   500 2982    19882   57895   34719   346751  970e6   tt_16D1C3L12__abyss_128-scaffolds.fa
time user=21.07s system=5.81s elapsed=27.97s cpu=96% memory=4 job=
time user=0.00s system=0.00s elapsed=27.97s cpu=0% memory=0 job=
ln -sf tt_16D1C3L12__abyss_128-stats.tab tt_16D1C3L12__abyss_128-stats
tr '\t' , <tt_16D1C3L12__abyss_128-stats.tab >tt_16D1C3L12__abyss_128-stats.csv
abyss-tabtomd tt_16D1C3L12__abyss_128-stats.tab >tt_16D1C3L12__abyss_128-stats.md

If the issue is that I did not round the value, please fix the warning message, or round on the fly.

Yeah, thanks for the note where to find the help text, I knew that we already discussed that but I forgot the syntax. Anyway, maybe the logged text message could be improved?

Anyway, after retrying with

$ rm  tt_16D1C3L12__abyss_128-7.fa  tt_16D1C3L12__abyss_128-7.dot  tt_16D1C3L12__abyss_128-8.fa  tt_16D1C3L12__abyss_128-scaffolds.fa  tt_16D1C3L12__abyss_128-scaffolds.dot  tt_16D1C3L12__abyss_128-8.dot  tt_16D1C3L12__abyss_128-stats.tab  tt_16D1C3L12__abyss_128-stats.md  tt_16D1C3L12__abyss_128-stats.csv  tt_16D1C3L12__abyss_128-stats
$ abyss-pe FIXMATE_OPTIONS=--qname N=10 c=4 ...

I am getting same results, though.

sjackman commented 6 years ago

c can take any real value. e can take only integer values. You would need to start the assembly over from the very beginning to change the values of e and c, which are parameters of the unitig assembler. I don't usually change these parameters, and use the default values.

mmokrejs commented 6 years ago

OK, I suspected a bit that this is about contigging/unitigging step, so please improve the message. I am fine rerunning the whole as it does make sense to throw away contigs with low coverage.

The manpage is hard to understand to a non-assembler developer in my opinion:

       c      minimum mean k-mer coverage of a unitig [sqrt(median)]

       e      minimum erosion k-mer coverage [round(sqrt(median))]

       E      minimum erosion k-mer coverage per strand [1 if sqrt(median) > 2 else 0]
sjackman commented 6 years ago

ABySS detects suitable values for c and e and does discard unitigs with low coverage. You shouldn't need to change these values. If you go back to the unitig log file, you can find out what values ABySS chose for these parameters.

mmokrejs commented 6 years ago

So what line should I grep for in the abyss-pe v=-v log file?

sjackman commented 6 years ago
Loaded 8243644 k-mer
Hash load: 8243644 / 33554432 = 0.246 using 2.93 GB
Minimum k-mer coverage is 21
Coverage: 21    Reconstruction: 210077
Coverage: 10    Reconstruction: 214189
Coverage: 10    Reconstruction: 214189
Using a coverage threshold of 10...
The median k-mer coverage is 100
The reconstruction is 214189
The k-mer coverage threshold is 10
Setting parameter e (erode) to 10
Setting parameter E (erodeStrand) to 1
Setting parameter c (coverage) to 10
mmokrejs commented 6 years ago
Minimum k-mer coverage is 10
0: Coverage: 10 Reconstruction: 1237153695
0: Coverage: 6.4        Reconstruction: 1285143208
0: Coverage: 6.32       Reconstruction: 1285143208
Using a coverage threshold of 6...
The median k-mer coverage is 40
The reconstruction is 1285143208
The k-mer coverage threshold is 6.32
Setting parameter e (erode) to 6
Setting parameter E (erodeStrand) to 1
Setting parameter c (coverage) to 6.32
Finding adjacenct k-mer...
mmokrejs commented 6 years ago
$ abyss-pe FIXMATE_OPTIONS=--qname N=10 c=4 v=-v np=104 j=104 k=128 ...
...
warning: -c,--coverage was specified, but -e,--erode was not specified
Previously, the default was -e2 (or --erode=2).
ABySS 2.0.3
ABYSS-P -k128 -q3 -c4 -v ...

So the original Consider increasing the coverage threshold parameter, c, to 3.81982 message was incomplete and did not tell me I am supposed to alter e value too.

When abyss-pe detects c=4 in its input, it would have been wise to output what it ended up using. @sjackman Please, make the messages clearer. I do not mind abyss is not fully automated solution but the messages to the user should not state just half of the truth. Thank you.

sjackman commented 6 years ago

Since ABySS autodetected used c=6.32 I wouldn't recommend decreasing it to c=4.

mmokrejs commented 6 years ago

But didn't the output from MergeContigs say increase the threshold to increase strictness?

The minimum coverage of single-end contigs is 1.22115.
The minimum coverage of merged contigs is 3.81982.
Consider increasing the coverage threshold parameter, c, to 3.81982.

Then the messages printed in step 7 of the assembly (https://github.com/bcgsc/abyss/issues/187#issuecomment-377565339) are just wrong.

sjackman commented 6 years ago

Yep, it is pretty much wrong. The message was originally written intended to be printed after stage 3 of the assembly. That same tool MergeContigs was then later used at stages 6 and 8 of the assembly, but still prints that same message, thought it's not accurate at the later stages.

mmokrejs commented 6 years ago

Thank you for clarification, now I am starting to understand.

Meanwhile the new job from https://github.com/bcgsc/abyss/issues/187#issuecomment-377696140 with abyss-pe ... N=10 c=4 ... progressed and reported:

Loaded 11630646625 k-mer. At least 837 GB of RAM is required.
Minimum k-mer coverage is 10
0: Coverage: 10 Reconstruction: 1237153695
0: Coverage: 6.4        Reconstruction: 1285143208
0: Coverage: 6.32       Reconstruction: 1285143208
Using a coverage threshold of 6...
The median k-mer coverage is 40
The reconstruction is 1285143208
The k-mer coverage threshold is 6.32
Setting parameter e (erode) to 6
Setting parameter E (erodeStrand) to 1
Finding adjacent k-mer...

The line Setting parameter c (coverage) to 6.32 is missing. Either due to 2.0.2 to 2.0.3 upgrade or due to some logic behind.

Here is more from the abyss-pe ... N=10 c=4 ... job progressing now:

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_128-[0-9].fa
n   n:500   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
8420759 540881  104722  206957  1371    500 1000    2388    5040    3376    42131   897.4e6 tt_16D1C3L12__abyss_128-1.fa
3990013 540881  104722  206957  1371    500 1000    2388    5040    3376    42131   897.4e6 tt_16D1C3L12__abyss_128-2.fa
3659188 478583  87983   165909  1758    500 1197    2906    6126    4072    57779   914.4e6 tt_16D1C3L12__abyss_128-3.fa

and here is the original result with autodetected values (c=6.32, cannot say what the N= really was):

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_128-[0-9].fa
n   n:500   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
7278490 540152  103434  200909  1417    500 1008    2430    5164    3450    42126   905.4e6 tt_16D1C3L12__abyss_128-1.fa
3606504 540152  103434  200909  1417    500 1008    2430    5164    3450    42126   905.4e6 tt_16D1C3L12__abyss_128-2.fa
3282085 479704  87706   162732  1801    500 1203    2936    6202    4112    57779   921.7e6 tt_16D1C3L12__abyss_128-3.fa

I stoppped this abyss-pe ... N=10 c=4 ... job and started a new-one with just abyss-pe ... N=10 ....

mmokrejs commented 6 years ago

Iterestingly, the job with abyss-pe ... N=10 ... needed 1TB of RAM but the results are exactly same as before:

$ abyss-fac -G 1267403131 tt_16D1C3L12__abyss_128-[0-9].fa
n   n:500   L50 LG50    NG50    min N80 N50 N20 E-size  max sum name
7278490 540152  103434  200909  1417    500 1008    2430    5164    3450    42126   905.4e6 tt_16D1C3L12__abyss_128-1.fa
3606507 540152  103434  200909  1417    500 1008    2430    5164    3450    42126   905.4e6 tt_16D1C3L12__abyss_128-2.fa
3282088 479704  87706   162732  1801    500 1203    2936    6202    4112    57779   921.7e6 tt_16D1C3L12__abyss_128-3.fa
8089    0   0   0   0   0   0   0   0   0   0   0   tt_16D1C3L12__abyss_128-4.fa
5248    0   0   0   0   0   0   0   0   0   0   0   tt_16D1C3L12__abyss_128-5.fa
2524714 320183  47968   79412   3892    500 2439    5684    11904   7801    84004   972.2e6 tt_16D1C3L12__abyss_128-6.fa
1093    563 182 563 503 503 784 1227    1854    1371    3813    624124  tt_16D1C3L12__abyss_128-7.fa
2383939 195685  4463    8913    23272   500 7332    48678   144350  83684   910150  968.8e6 tt_16D1C3L12__abyss_128-8.fa

N=10 is the default as stated in https://github.com/bcgsc/abyss/issues/187#issuecomment-377417061 so the above is not a surprise except the higher memory footprint. But maybe that has to do with upgrade to abyss-2.0.3 which is designed to have larger footprint.

sjackman commented 6 years ago

The line Setting parameter c (coverage) to 6.32 is missing. Either due to 2.0.2 to 2.0.3 upgrade or due to some logic behind.

The message is missing because you have specified c=4, so it's using the specified c=4 and not using c=6.32.

lsterck commented 5 years ago

Apologies for reviving old threads but my recent experience has thought me to also 'play/optimise' with the n for the mapping step. @mmokrejs , I was roughly in the same situation as you are/were (aka seeing very little difference when changing N for scaffolding/contig building), after several tries I figured out that abyss most likely already filters out way to much data on the DistanceEst step, which then consequently does not end up anymore in the contiging/scaffolding step. (see this thread https://github.com/bcgsc/abyss/issues/258 for more info)

sjackman commented 5 years ago

Yes, you can set n to 1 for DistanceEst to retain everything, so that you have more control at later steps.