Open dcposch opened 6 years ago
i tried dat share
with just a few hundred wiki articles, and it went thru them in a few seconds, reporting ~500KB/sec throughput
dat share
with 2000 wiki articles slows down to ~150KB/sec throughput, as reported above
the original dat share
run, on ~300k articles, has slowed down to 8KB/sec:
@mafintosh feross says you know why?
@feross @mafintosh
even with a small number of entries per directory, it still grinds to a halt after a few hundred MB :/
Anyway you could try this on a Mac as well?
On Sun, Jan 14, 2018, 22:55 DC notifications@github.com wrote:
@feross https://github.com/feross @mafintosh https://github.com/mafintosh update
even with a small number of entries per directory, it still grinds to a halt after a few hundred MB :/
[image: image] https://user-images.githubusercontent.com/169280/34920587-e9946736-f942-11e7-8b58-e1efa6b2f010.png
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/datproject/dat/issues/915#issuecomment-357541713, or mute the thread https://github.com/notifications/unsubscribe-auth/AAW_VQWhnxnIHkM239-vEfdoHR5GAe0Oks5tKmnSgaJpZM4Rdd9X .
@mafintosh sure -- i just ran it on Mac using the profiler
it looks like the issue might be in append-tree
append-tree
i narrowed it down a bit by adding debug()
statements. turns out, for every file dat
adds, it waits for process.nextTick
100s (maybe 1000s) of times:
dat-node IMPORT put: /A/Alan_Bennett.html +63ms
append-tree cache hit, waiting for next tick: 227 +1ms
append-tree cache hit, waiting for next tick: 1 +0ms
append-tree cache hit, waiting for next tick: 2 +0ms
append-tree cache hit, waiting for next tick: 3 +0ms
append-tree cache hit, waiting for next tick: 4 +0ms
append-tree cache hit, waiting for next tick: 5 +0ms
append-tree cache hit, waiting for next tick: 6 +0ms
append-tree cache hit, waiting for next tick: 7 +0ms
append-tree cache hit, waiting for next tick: 8 +0ms
append-tree cache hit, waiting for next tick: 9 +0ms
append-tree cache hit, waiting for next tick: 10 +0ms
append-tree cache hit, waiting for next tick: 11 +0ms
append-tree cache hit, waiting for next tick: 12 +0ms
append-tree cache hit, waiting for next tick: 13 +0ms
append-tree cache hit, waiting for next tick: 14 +0ms
append-tree cache hit, waiting for next tick: 15 +0ms
append-tree cache hit, waiting for next tick: 16 +0ms
append-tree cache hit, waiting for next tick: 17 +0ms
append-tree cache hit, waiting for next tick: 18 +0ms
append-tree cache hit, waiting for next tick: 19 +0ms
append-tree cache hit, waiting for next tick: 20 +0ms
append-tree cache hit, waiting for next tick: 21 +0ms
append-tree cache hit, waiting for next tick: 22 +0ms
append-tree cache hit, waiting for next tick: 23 +0ms
slightly modified append-tree
:
Tree.prototype._getAndDecode = function(seq, opts, cb) {
if (opts && opts.cached) opts.wait = false
var self = this
var cached = this._cache && this._cache.get(seq)
if (cached) {
debug('cache hit, waiting for next tick: ' + seq)
return nextTick(cb, null, cached, seq)
}
debug('cache miss: ' + seq)
this.feed.get(seq, opts, function(err, value) {
if (err) return cb(err)
var node = new Node(messages.Node.decode(value), seq)
if (self._cache) self._cache.set(seq, node)
cb(null, node, seq)
})
}
this looks bad
after further investigation, i found out that deep in the dat
, internals, the thing that actually records a newly added file to the hypercore log is append-tree
._put()
:
this function runs in O(n) time, appending a log entry of size O(n), if the folder contains n files
this means that adding n
files to a folder, one after another, takes O(n^2) time and produces a hypercore feed with O(n) entries but taking up O(n^2) total space
...
cc @mafintosh
running dat share
with added debug logs in append-tree
to demonstrate the issue:
the folder contains ~2500 files totalling 30MB
$ rm -rf .dat && DEBUG=append-tree node ~/code/dat/bin/cli.js share --watch=false
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)
Checking for file updates...
Ctrl+C to Exit
append-tree appending node to hypercore, 58 bytes +0ms
append-tree appending node to hypercore, 65 bytes +10ms
append-tree appending node to hypercore, 60 bytes +2ms
append-tree appending node to hypercore, 71 bytes +3ms
append-tree appending node to hypercore, 71 bytes +2ms
append-tree appending node to hypercore, 81 bytes +3ms
append-tree appending node to hypercore, 71 bytes +6ms
append-tree appending node to hypercore, 72 bytes +3ms
...
append-tree appending node to hypercore, 439 bytes +6ms
append-tree appending node to hypercore, 425 bytes +5ms
append-tree appending node to hypercore, 419 bytes +5ms
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)
Metadata created for 357 of 1110 files (2.1 MB/s)
(Calculating file count...)
ADD: A/Albanian.html (2.0 KB)
Ctrl+C to Exit
append-tree appending node to hypercore, 433 bytes +8ms
append-tree appending node to hypercore, 439 bytes +4ms
append-tree appending node to hypercore, 433 bytes +6ms
append-tree appending node to hypercore, 432 bytes +4ms
append-tree appending node to hypercore, 428 bytes +4ms
...
append-tree appending node to hypercore, 1352 bytes +11ms
append-tree appending node to hypercore, 1368 bytes +10ms
append-tree appending node to hypercore, 1355 bytes +9ms
append-tree appending node to hypercore, 1353 bytes +9ms
append-tree appending node to hypercore, 1360 bytes +8ms
append-tree appending node to hypercore, 1356 bytes +11ms
append-tree appending node to hypercore, 1360 bytes +11ms
append-tree appending node to hypercore, 1359 bytes +9ms
append-tree appending node to hypercore, 1362 bytes +8ms
append-tree appending node to hypercore, 1359 bytes +9ms
...
append-tree appending node to hypercore, 2122 bytes +16ms
append-tree appending node to hypercore, 2118 bytes +16ms
append-tree appending node to hypercore, 2119 bytes +16ms
append-tree appending node to hypercore, 2124 bytes +16ms
append-tree appending node to hypercore, 2133 bytes +16ms
append-tree appending node to hypercore, 2125 bytes +15ms
append-tree appending node to hypercore, 2137 bytes +17ms
append-tree appending node to hypercore, 2123 bytes +16ms
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)
Creating metadata for 2067 files (799 KB/s)
[=========================================-] 100%
ADD: A/Alzonne.html (1.9 KB)
Ctrl+C to Exit
append-tree appending node to hypercore, 2148 bytes +21ms
append-tree appending node to hypercore, 2140 bytes +16ms
append-tree appending node to hypercore, 2128 bytes +15ms
append-tree appending node to hypercore, 57 bytes +9ms
append-tree appending node to hypercore, 56 bytes +2ms
DCs-MacBook:sample2 dc$
adding n files takes O(n^2) time, O(n^2) space
this looks bad
after further investigation, i found out that deep in the
dat
, internals, the thing that actually records a newly added file to the hypercore log isappend-tree
._put()
:this function runs in O(n) time, appending a log entry of size O(n), if the folder contains n files
Hi, just bumping this one after 18 months. Is something that can be fixed or is it fundamental? I'm just getting started with dat and was super pumped until I hit this. Honestly I'm still super pumped about dat but this one makes one intended use non-feasible. Here is hoping there is still hope for this issue. Thanks all!
There's a new version of dat coming that improves performance quite a bit
Here's the new module that enables this: https://github.com/mafintosh/hypertrie
Great. I don't want to hijack the thread, but I'm also new to DAT. I'm only using the command line. I would like to test the above module or the devel version of dat that includes it. The install docs don't seem to reference installing from git nor how to install modules. Does the module supersede/enhance core functionality so it can be installed on top of my current install? If someone points me to some directions, I can really exercise this small file use case and provide feedback. Thanks,
Current timeline is to release the next dat cli with this new update by the end of the year. Until then, you can use hyperdrive-daemon which has most of the basic functionality. Cc @andrewosh
@okdistribute any update with that new update? I have similar data with lots of files I'd like to share, so have waited for this issue to get closed for a while. Thanks
@Koeng101 @andrewosh is ironing out the last of the bugs in hyperdrive-daemon, IIRC. It should be good to start testing if you're okay with potentially needing to rebuild your archives in the future. 😁
@feross and i are creating a demo of Simple English Wikipedia hosted on dat
dat share
-ing the dump (1 big file) is fastsetup
results
we are calling
dat share
on one file, total 1.2GB:dat share
-ing the articles (small files) runs very slowly(notice that
stat
-ing all 300k files only takes <4 seconds, so that's not the bottleneck...)we are calling
dat share
on ~300k files, totalling 3GB:[... runs at about ~150KB/s, hasn't finished yet ...]
tldr;
dat share
throughput is 1000x less with small filesthese files are about 10KB on average, and
dat share
is processing just a couple of them per second