`dat share` is slow with lots of files

dcposch commented 6 years ago

@feross and i are creating a demo of Simple English Wikipedia hosted on dat

`dat share`-ing the dump (1 big file) is fast

setup

dat version: 13.9.2
os version: ubuntu 16.04.3
kernel version: 4.4.0-1047-aws
hardware: m5.large instance, using an 800GB EBS volume, standard SSD variety

results

$ ls -lha
total 1.2G
-rw-rw-r-- 1 ubuntu ubuntu 1.2G Jan  4  2017 wikipedia_en_simple_all_2017-01.zim
[...]

we are calling dat share on one file, total 1.2GB:

$ dat share
[... runs at about ~150MB/s, finishes in < 10 seconds ...]

`dat share`-ing the articles (small files) runs very slowly

$ du -sh *
252K    -
1.7G    A
1.3G    I
36K M
$ time ls -lha A I/m | wc -l
291382
real    0m3.740s

(notice that stat-ing all 300k files only takes <4 seconds, so that's not the bottleneck...)

we are calling dat share on ~300k files, totalling 3GB:

[... runs at about ~150KB/s, hasn't finished yet ...]

tldr; `dat share` throughput is 1000x less with small files

these files are about 10KB on average, and dat share is processing just a couple of them per second

dcposch commented 6 years ago

update: it gets much slower over time

i tried dat share with just a few hundred wiki articles, and it went thru them in a few seconds, reporting ~500KB/sec throughput

dat share with 2000 wiki articles slows down to ~150KB/sec throughput, as reported above

the original dat share run, on ~300k articles, has slowed down to 8KB/sec:

dcposch commented 6 years ago

@mafintosh feross says you know why?

dcposch commented 6 years ago

@feross @mafintosh

update

even with a small number of entries per directory, it still grinds to a halt after a few hundred MB :/

mafintosh commented 6 years ago

Anyway you could try this on a Mac as well?

On Sun, Jan 14, 2018, 22:55 DC notifications@github.com wrote:

@feross https://github.com/feross @mafintosh https://github.com/mafintosh update

even with a small number of entries per directory, it still grinds to a halt after a few hundred MB :/

[image: image] https://user-images.githubusercontent.com/169280/34920587-e9946736-f942-11e7-8b58-e1efa6b2f010.png

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/datproject/dat/issues/915#issuecomment-357541713, or mute the thread https://github.com/notifications/unsubscribe-auth/AAW_VQWhnxnIHkM239-vEfdoHR5GAe0Oks5tKmnSgaJpZM4Rdd9X .

dcposch commented 6 years ago

@mafintosh sure -- i just ran it on Mac using the profiler

it looks like the issue might be in append-tree

dcposch commented 6 years ago

update: issues almost certainly involves `append-tree`

i narrowed it down a bit by adding debug() statements. turns out, for every file dat adds, it waits for process.nextTick 100s (maybe 1000s) of times:

log

  dat-node IMPORT put: /A/Alan_Bennett.html +63ms
  append-tree cache hit, waiting for next tick: 227 +1ms
  append-tree cache hit, waiting for next tick: 1 +0ms
  append-tree cache hit, waiting for next tick: 2 +0ms
  append-tree cache hit, waiting for next tick: 3 +0ms
  append-tree cache hit, waiting for next tick: 4 +0ms
  append-tree cache hit, waiting for next tick: 5 +0ms
  append-tree cache hit, waiting for next tick: 6 +0ms
  append-tree cache hit, waiting for next tick: 7 +0ms
  append-tree cache hit, waiting for next tick: 8 +0ms
  append-tree cache hit, waiting for next tick: 9 +0ms
  append-tree cache hit, waiting for next tick: 10 +0ms
  append-tree cache hit, waiting for next tick: 11 +0ms
  append-tree cache hit, waiting for next tick: 12 +0ms
  append-tree cache hit, waiting for next tick: 13 +0ms
  append-tree cache hit, waiting for next tick: 14 +0ms
  append-tree cache hit, waiting for next tick: 15 +0ms
  append-tree cache hit, waiting for next tick: 16 +0ms
  append-tree cache hit, waiting for next tick: 17 +0ms
  append-tree cache hit, waiting for next tick: 18 +0ms
  append-tree cache hit, waiting for next tick: 19 +0ms
  append-tree cache hit, waiting for next tick: 20 +0ms
  append-tree cache hit, waiting for next tick: 21 +0ms
  append-tree cache hit, waiting for next tick: 22 +0ms
  append-tree cache hit, waiting for next tick: 23 +0ms

code

slightly modified append-tree:


Tree.prototype._getAndDecode = function(seq, opts, cb) {
  if (opts && opts.cached) opts.wait = false

  var self = this
  var cached = this._cache && this._cache.get(seq)
  if (cached) {
    debug('cache hit, waiting for next tick: ' + seq)
    return nextTick(cb, null, cached, seq)
  }

  debug('cache miss: ' + seq)
  this.feed.get(seq, opts, function(err, value) {
    if (err) return cb(err)
    var node = new Node(messages.Node.decode(value), seq)
    if (self._cache) self._cache.set(seq, node)
    cb(null, node, seq)
  })
}

dcposch commented 6 years ago

adding n files takes O(n^2) time, O(n^2) space

this looks bad

after further investigation, i found out that deep in the dat, internals, the thing that actually records a newly added file to the hypercore log is append-tree._put():

this function runs in O(n) time, appending a log entry of size O(n), if the folder contains n files

this means that adding n files to a folder, one after another, takes O(n^2) time and produces a hypercore feed with O(n) entries but taking up O(n^2) total space

...

cc @mafintosh

log

running dat share with added debug logs in append-tree to demonstrate the issue:

the folder contains ~2500 files totalling 30MB

$ rm -rf .dat && DEBUG=append-tree node ~/code/dat/bin/cli.js share --watch=false
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)
Checking for file updates...
Ctrl+C to Exit
  append-tree appending node to hypercore, 58 bytes +0ms
  append-tree appending node to hypercore, 65 bytes +10ms
  append-tree appending node to hypercore, 60 bytes +2ms
  append-tree appending node to hypercore, 71 bytes +3ms
  append-tree appending node to hypercore, 71 bytes +2ms
  append-tree appending node to hypercore, 81 bytes +3ms
  append-tree appending node to hypercore, 71 bytes +6ms
  append-tree appending node to hypercore, 72 bytes +3ms
...
  append-tree appending node to hypercore, 439 bytes +6ms
  append-tree appending node to hypercore, 425 bytes +5ms
  append-tree appending node to hypercore, 419 bytes +5ms
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)

Metadata created for 357 of 1110 files (2.1 MB/s)
(Calculating file count...)
ADD: A/Albanian.html (2.0 KB)

Ctrl+C to Exit
  append-tree appending node to hypercore, 433 bytes +8ms
  append-tree appending node to hypercore, 439 bytes +4ms
  append-tree appending node to hypercore, 433 bytes +6ms
  append-tree appending node to hypercore, 432 bytes +4ms
  append-tree appending node to hypercore, 428 bytes +4ms
...
  append-tree appending node to hypercore, 1352 bytes +11ms
  append-tree appending node to hypercore, 1368 bytes +10ms
  append-tree appending node to hypercore, 1355 bytes +9ms
  append-tree appending node to hypercore, 1353 bytes +9ms
  append-tree appending node to hypercore, 1360 bytes +8ms
  append-tree appending node to hypercore, 1356 bytes +11ms
  append-tree appending node to hypercore, 1360 bytes +11ms
  append-tree appending node to hypercore, 1359 bytes +9ms
  append-tree appending node to hypercore, 1362 bytes +8ms
  append-tree appending node to hypercore, 1359 bytes +9ms
...
  append-tree appending node to hypercore, 2122 bytes +16ms
  append-tree appending node to hypercore, 2118 bytes +16ms
  append-tree appending node to hypercore, 2119 bytes +16ms
  append-tree appending node to hypercore, 2124 bytes +16ms
  append-tree appending node to hypercore, 2133 bytes +16ms
  append-tree appending node to hypercore, 2125 bytes +15ms
  append-tree appending node to hypercore, 2137 bytes +17ms
  append-tree appending node to hypercore, 2123 bytes +16ms
dat v13.9.2
Created new dat in /Users/dc/sample2/.dat
dat://b1ea4e0c8b72573bac844f5d8d50c31645a29434891aa2559a4381ae153090f8
Sharing dat: (stats disabled)

Creating metadata for 2067 files (799 KB/s)
[=========================================-] 100%
ADD: A/Alzonne.html (1.9 KB)

Ctrl+C to Exit
  append-tree appending node to hypercore, 2148 bytes +21ms
  append-tree appending node to hypercore, 2140 bytes +16ms
  append-tree appending node to hypercore, 2128 bytes +15ms
  append-tree appending node to hypercore, 57 bytes +9ms
  append-tree appending node to hypercore, 56 bytes +2ms
DCs-MacBook:sample2 dc$

gustafson commented 5 years ago

adding n files takes O(n^2) time, O(n^2) space

this looks bad

after further investigation, i found out that deep in the dat, internals, the thing that actually records a newly added file to the hypercore log is append-tree._put():

this function runs in O(n) time, appending a log entry of size O(n), if the folder contains n files

Hi, just bumping this one after 18 months. Is something that can be fixed or is it fundamental? I'm just getting started with dat and was super pumped until I hit this. Honestly I'm still super pumped about dat but this one makes one intended use non-feasible. Here is hoping there is still hope for this issue. Thanks all!

pfrazee commented 5 years ago

There's a new version of dat coming that improves performance quite a bit

RangerMauve commented 5 years ago

Here's the new module that enables this: https://github.com/mafintosh/hypertrie

gustafson commented 5 years ago

Great. I don't want to hijack the thread, but I'm also new to DAT. I'm only using the command line. I would like to test the above module or the devel version of dat that includes it. The install docs don't seem to reference installing from git nor how to install modules. Does the module supersede/enhance core functionality so it can be installed on top of my current install? If someone points me to some directions, I can really exercise this small file use case and provide feedback. Thanks,

okdistribute commented 5 years ago

Current timeline is to release the next dat cli with this new update by the end of the year. Until then, you can use hyperdrive-daemon which has most of the basic functionality. Cc @andrewosh

Koeng101 commented 4 years ago

@okdistribute any update with that new update? I have similar data with lots of files I'd like to share, so have waited for this issue to get closed for a while. Thanks

RangerMauve commented 4 years ago

@Koeng101 @andrewosh is ironing out the last of the bugs in hyperdrive-daemon, IIRC. It should be good to start testing if you're okay with potentially needing to rebuild your archives in the future. 😁

dat-ecosystem / dat