TritonDataCenter / manta-mlive

Manta liveness checker
Mozilla Public License 2.0
0 stars 3 forks source link

intermittent failures with Node v0.10 #10

Open davepacheco opened 6 years ago

davepacheco commented 6 years ago

With a fresh clone using Node 0.10.48 on OS X El Capitan (10.11.6), I see these errors:

$ ./bin/mlive -s 2 -S 15
will stomp over directory at "/dap/stor/mlive"
will test for liveness: reads, writes
assuming 2 metadata shards
using base paths: /dap/stor/mlive
time between requests: 50 ms
maximum outstanding requests: 100
environment:
    MANTA_USER = dap
    MANTA_KEY_ID = 56:f3:e1:56:3d:e6:f7:83:a9:ce:19:5d:62:ba:5c:1f
    MANTA_URL = https://172.26.5.11
    MANTA_TLS_INSECURE = 1
skipping directory tree setup
2018-03-13T18:45:50.069Z: reads okay, writes okay (6/6 ok since start)
2018-03-13T18:46:05.094Z: all okay (150/150 ok since 2018-03-13T18:45:50.069Z)
2018-03-13T18:46:14.110Z: read okay, write partial_outage (82/83 ok since 2018-03-13T18:46:05.094Z)
2018-03-13T18:46:15.113Z: all partial_outage (14/17 ok since 2018-03-13T18:46:14.110Z)
2018-03-13T18:46:17.119Z: all okay (26/30 ok since 2018-03-13T18:46:15.113Z)
2018-03-13T18:46:18.119Z: all partial_outage (11/14 ok since 2018-03-13T18:46:17.119Z)
2018-03-13T18:46:21.127Z: read partial_outage, write okay (31/40 ok since 2018-03-13T18:46:18.119Z)
2018-03-13T18:46:22.127Z: all okay (11/11 ok since 2018-03-13T18:46:21.127Z)
2018-03-13T18:46:23.127Z: read partial_outage, write okay (9/10 ok since 2018-03-13T18:46:22.127Z)
2018-03-13T18:46:24.132Z: all partial_outage (9/11 ok since 2018-03-13T18:46:23.127Z)
2018-03-13T18:46:25.133Z: read partial_outage, write okay (10/11 ok since 2018-03-13T18:46:24.132Z)
2018-03-13T18:46:26.133Z: read okay, write partial_outage (8/9 ok since 2018-03-13T18:46:25.133Z)
2018-03-13T18:46:28.139Z: all okay (20/21 ok since 2018-03-13T18:46:26.133Z)
2018-03-13T18:46:29.141Z: read partial_outage, write okay (9/10 ok since 2018-03-13T18:46:28.139Z)
2018-03-13T18:46:32.152Z: all okay (30/32 ok since 2018-03-13T18:46:29.141Z)
2018-03-13T18:46:34.155Z: all partial_outage (24/27 ok since 2018-03-13T18:46:32.152Z)
2018-03-13T18:46:35.157Z: read okay, write partial_outage (8/10 ok since 2018-03-13T18:46:34.155Z)
2018-03-13T18:46:37.160Z: all partial_outage (19/23 ok since 2018-03-13T18:46:35.157Z)
2018-03-13T18:46:38.162Z: all okay (9/9 ok since 2018-03-13T18:46:37.160Z)
2018-03-13T18:46:39.163Z: read okay, write partial_outage (9/10 ok since 2018-03-13T18:46:38.162Z)
2018-03-13T18:46:40.165Z: read partial_outage, write okay (9/11 ok since 2018-03-13T18:46:39.163Z)
2018-03-13T18:46:41.166Z: read okay, write partial_outage (8/10 ok since 2018-03-13T18:46:40.165Z)
2018-03-13T18:46:42.167Z: all partial_outage (9/12 ok since 2018-03-13T18:46:41.166Z)
2018-03-13T18:46:43.168Z: all okay (9/9 ok since 2018-03-13T18:46:42.167Z)
2018-03-13T18:46:44.170Z: all partial_outage (10/13 ok since 2018-03-13T18:46:43.168Z)
2018-03-13T18:46:45.173Z: read partial_outage, write okay (9/11 ok since 2018-03-13T18:46:44.170Z)
2018-03-13T18:46:46.174Z: all okay (17/17 ok since 2018-03-13T18:46:45.173Z)
2018-03-13T18:46:47.177Z: all partial_outage (9/12 ok since 2018-03-13T18:46:46.174Z)
2018-03-13T18:46:48.181Z: all okay (10/10 ok since 2018-03-13T18:46:47.177Z)
2018-03-13T18:46:49.184Z: read okay, write partial_outage (11/12 ok since 2018-03-13T18:46:48.181Z)

During that same period, using a build of the repo from quite a while ago, and the same node binary, I saw no issues:

$ ./bin/mlive -s 2 -S 15
will stomp over directory at "/dap/stor/mlive"
will test for liveness: reads, writes
assuming 2 metadata shards
using base paths: /dap/stor/mlive
time between requests: 50 ms
maximum outstanding requests: 100
environment:
    MANTA_USER = dap
    MANTA_KEY_ID = 56:f3:e1:56:3d:e6:f7:83:a9:ce:19:5d:62:ba:5c:1f
    MANTA_URL = https://172.26.5.11
    MANTA_TLS_INSECURE = 1
skipping directory tree setup
2018-03-13T18:44:11.117Z: reads okay, writes okay (7/7 ok since start)
2018-03-13T18:44:26.149Z: all okay (291/291 ok since 2018-03-13T18:44:11.117Z)
2018-03-13T18:44:41.177Z: all okay (287/287 ok since 2018-03-13T18:44:26.149Z)
2018-03-13T18:44:56.211Z: all okay (289/289 ok since 2018-03-13T18:44:41.177Z)
2018-03-13T18:45:11.234Z: all okay (290/290 ok since 2018-03-13T18:44:56.211Z)
2018-03-13T18:45:26.249Z: all okay (288/288 ok since 2018-03-13T18:45:11.234Z)
2018-03-13T18:45:41.286Z: all okay (290/290 ok since 2018-03-13T18:45:26.249Z)
2018-03-13T18:45:56.315Z: all okay (288/288 ok since 2018-03-13T18:45:41.286Z)
2018-03-13T18:46:11.341Z: all okay (289/289 ok since 2018-03-13T18:45:56.315Z)
2018-03-13T18:46:26.378Z: all okay (289/289 ok since 2018-03-13T18:46:11.341Z)
2018-03-13T18:46:41.400Z: all okay (284/284 ok since 2018-03-13T18:46:26.378Z)
2018-03-13T18:46:56.427Z: all okay (289/289 ok since 2018-03-13T18:46:41.400Z)
2018-03-13T18:47:11.460Z: all okay (288/288 ok since 2018-03-13T18:46:56.427Z)
2018-03-13T18:47:26.487Z: all okay (286/286 ok since 2018-03-13T18:47:11.460Z)

Running with a more verbose log level, we see that the failures look like this:

[2018-03-13T19:55:29.571Z] TRACE: mlive/MantaClient/72625 on zathras.local: Request failed
    Error: socket hang up
        at createHangUpError (http.js:1535:15)
        at CleartextStream.socketCloseListener (http.js:1585:23)
        at CleartextStream.emit (events.js:117:20)
        at tls.js:743:10
        at process._tickCallback (node.js:458:13)

A similar problem is also reproducible on SmartOS with Node 0.10, but not using Node v4. And although we have recently upgraded loadbalancer images in "master", this problem appears to affect load balancer versions as old as September of last year, which was before those changes. This suggested potentially a client problem -- maybe an updated dependency? node-manta is the only one that seems likely to affect this behavior.

The old build that shows no problems has these dependencies:

$ npm ls
mlive@1.0.0 /Users/dap/work/tools/manta-mlive
├── assert-plus@0.1.5 invalid
├─┬ bunyan@1.2.3 invalid
│ ├─┬ dtrace-provider@0.3.1
│ │ └── nan@1.3.0
│ └─┬ mv@2.0.3
│   ├─┬ mkdirp@0.5.0
│   │ └── minimist@0.0.8
│   ├── ncp@0.6.0
│   └── rimraf@2.2.8
├── extsprintf@1.2.1 invalid
├─┬ manta@1.4.6 invalid
│ ├── backoff@2.3.0
│ ├─┬ bunyan@0.22.1
│ │ ├── dtrace-provider@0.2.8
│ │ └── mv@0.0.5
│ ├── clone@0.1.11
│ ├─┬ cmdln@1.3.1
│ │ └── assert-plus@0.1.3
│ ├─┬ dashdash@1.3.2
│ │ └── assert-plus@0.1.2
│ ├── extsprintf@1.0.2
│ ├── hogan.js@2.0.0
│ ├─┬ http-signature@0.10.0
│ │ ├── asn1@0.1.11
│ │ ├── assert-plus@0.1.2
│ │ └── ctype@0.5.2
│ ├─┬ jsprim@0.5.0
│ │ └── json-schema@0.2.2
│ ├── lstream@0.0.3
│ ├── mime@1.2.11
│ ├── moment@2.4.0
│ ├── node-uuid@1.4.1
│ ├── once@1.3.0
│ ├── path-platform@0.0.1
│ ├─┬ progbar@1.0.0
│ │ ├─┬ readable-stream@1.0.33
│ │ │ ├── core-util-is@1.0.1
│ │ │ ├── inherits@2.0.1
│ │ │ ├── isarray@0.0.1
│ │ │ └── string_decoder@0.10.31
│ │ └── sprintf@0.1.5
│ ├─┬ readable-stream@1.1.9
│ │ ├── core-util-is@1.0.1
│ │ └── debuglog@0.0.2
│ ├─┬ restify@2.8.1
│ │ ├── csv@0.3.7
│ │ ├── deep-equal@0.0.0
│ │ ├── dtrace-provider@0.2.8
│ │ ├── escape-regexp-component@1.0.2
│ │ ├── formidable@1.0.14
│ │ ├── keep-alive-agent@0.0.1
│ │ ├── lru-cache@2.3.1
│ │ ├── negotiator@0.3.0
│ │ ├── qs@0.6.6
│ │ ├── semver@2.2.1
│ │ ├── spdy@1.19.3
│ │ ├── tunnel-agent@0.4.0
│ │ └── verror@1.3.7
│ ├── showdown@0.3.1
│ ├─┬ ssh-agent@0.2.2
│ │ ├── ctype@0.5.0
│ │ └── posix-getopt@1.0.0
│ ├─┬ tar@0.1.18
│ │ ├── block-stream@0.0.7
│ │ ├─┬ fstream@0.1.31
│ │ │ ├── graceful-fs@3.0.5
│ │ │ ├─┬ mkdirp@0.5.0
│ │ │ │ └── minimist@0.0.8
│ │ │ └── rimraf@2.2.8
│ │ └── inherits@2.0.1
│ ├─┬ vasync@1.4.0
│ │ ├─┬ jsprim@0.3.0
│ │ │ ├── extsprintf@1.0.0
│ │ │ ├── json-schema@0.2.2
│ │ │ └── verror@1.3.3
│ │ └─┬ verror@1.1.0
│ │   └── extsprintf@1.0.0
│ ├── verror@1.3.6
│ └─┬ watershed@0.3.0
│   ├── dtrace-provider@0.2.8
│   └── readable-stream@1.0.2
├── posix-getopt@1.1.0 invalid
├─┬ vasync@1.6.2 invalid
│ └─┬ verror@1.1.0
│   └── extsprintf@1.0.0
└─┬ verror@1.6.0
  └── extsprintf@1.2.0

npm ERR! invalid: assert-plus@0.1.5 /Users/dap/work/tools/manta-mlive/node_modules/assert-plus
npm ERR! invalid: bunyan@1.2.3 /Users/dap/work/tools/manta-mlive/node_modules/bunyan
npm ERR! invalid: extsprintf@1.2.1 /Users/dap/work/tools/manta-mlive/node_modules/extsprintf
npm ERR! invalid: manta@1.4.6 /Users/dap/work/tools/manta-mlive/node_modules/manta
npm ERR! invalid: posix-getopt@1.1.0 /Users/dap/work/tools/manta-mlive/node_modules/posix-getopt
npm ERR! invalid: vasync@1.6.2 /Users/dap/work/tools/manta-mlive/node_modules/vasync
npm ERR! not ok code 0

It shows a bunch of "invalid" errors because I've apparently pulled down updates without rebuilding, so my packages technically no longer satisfy package.json. My fresh clone looks like this:

$ npm ls
mlive@1.0.0 /Users/dap/work/manta-mlive-latest
├── assert-plus@1.0.0
├─┬ bunyan@1.8.12
│ ├─┬ dtrace-provider@0.8.6
│ │ └── nan@2.9.2
│ ├── moment@2.21.0
│ ├─┬ mv@2.1.1
│ │ ├─┬ mkdirp@0.5.1
│ │ │ └── minimist@0.0.8
│ │ ├── ncp@2.0.0
│ │ └─┬ rimraf@2.4.5
│ │   └─┬ glob@6.0.4
│ │     ├─┬ inflight@1.0.6
│ │     │ └── wrappy@1.0.2
│ │     ├── inherits@2.0.3
│ │     ├─┬ minimatch@3.0.4
│ │     │ └─┬ brace-expansion@1.1.11
│ │     │   ├── balanced-match@1.0.0
│ │     │   └── concat-map@0.0.1
│ │     ├─┬ once@1.4.0
│ │     │ └── wrappy@1.0.2
│ │     └── path-is-absolute@1.0.1
│ └── safe-json-stringify@1.1.0
├── extsprintf@1.4.0
├─┬ manta@4.5.0
│ ├── backoff@2.3.0
│ ├── clone@0.1.19
│ ├─┬ cmdln@4.1.2
│ │ └── fuzzyset.js@0.0.1
│ ├── dashdash@1.14.1
│ ├── hogan.js@2.0.0
│ ├─┬ jsprim@1.4.1
│ │ ├── extsprintf@1.3.0
│ │ └── json-schema@0.2.3
│ ├─┬ lomstream@1.1.0
│ │ ├── assert-plus@0.1.5
│ │ ├── extsprintf@1.3.0
│ │ └─┬ vstream@0.1.0
│ │   └── extsprintf@1.2.0
│ ├── lstream@0.0.4
│ ├── mime@1.2.11
│ ├── moment@2.13.0
│ ├─┬ once@1.3.3
│ │ └── wrappy@1.0.2
│ ├── path-platform@0.0.1
│ ├─┬ progbar@1.2.0
│ │ ├── assert-plus@0.1.5
│ │ ├─┬ readable-stream@1.0.34
│ │ │ ├── core-util-is@1.0.2
│ │ │ ├── inherits@2.0.3
│ │ │ ├── isarray@0.0.1
│ │ │ └── string_decoder@0.10.31
│ │ └── sprintf@0.1.5
│ ├─┬ readable-stream@1.1.14
│ │ ├── core-util-is@1.0.2
│ │ ├── inherits@2.0.3
│ │ ├── isarray@0.0.1
│ │ └── string_decoder@0.10.31
│ ├─┬ restify-clients@1.4.1
│ │ ├─┬ backoff@2.5.0
│ │ │ └── precond@0.2.3
│ │ ├─┬ dtrace-provider@0.7.1
│ │ │ └── nan@2.9.2
│ │ ├── keep-alive-agent@0.0.1
│ │ ├── lodash@4.17.5
│ │ ├─┬ lru-cache@4.1.2
│ │ │ ├── pseudomap@1.0.2
│ │ │ └── yallist@2.1.2
│ │ ├── mime@1.6.0
│ │ ├── node-uuid@1.4.8
│ │ ├─┬ restify-errors@3.1.0
│ │ │ ├── assert-plus@0.2.0
│ │ │ └── lodash@3.10.1
│ │ ├── semver@5.5.0
│ │ └── tunnel-agent@0.4.3
│ ├─┬ showdown@1.4.4
│ │ └─┬ yargs@3.32.0
│ │   ├── camelcase@2.1.1
│ │   ├─┬ cliui@3.2.0
│ │   │ ├─┬ strip-ansi@3.0.1
│ │   │ │ └── ansi-regex@2.1.1
│ │   │ └── wrap-ansi@2.1.0
│ │   ├── decamelize@1.2.0
│ │   ├─┬ os-locale@1.4.0
│ │   │ └─┬ lcid@1.0.0
│ │   │   └── invert-kv@1.0.0
│ │   ├─┬ string-width@1.0.2
│ │   │ ├── code-point-at@1.1.0
│ │   │ ├─┬ is-fullwidth-code-point@1.0.0
│ │   │ │ └── number-is-nan@1.0.1
│ │   │ └─┬ strip-ansi@3.0.1
│ │   │   └── ansi-regex@2.1.1
│ │   ├── window-size@0.1.4
│ │   └── y18n@3.2.1
│ ├─┬ smartdc-auth@2.5.6
│ │ ├── clone@0.1.5
│ │ ├─┬ dashdash@1.10.1
│ │ │ └── assert-plus@0.1.5
│ │ ├── http-signature@1.2.0
│ │ ├── once@1.3.0
│ │ ├─┬ sshpk@1.14.1
│ │ │ ├── asn1@0.2.3
│ │ │ ├── bcrypt-pbkdf@1.0.1
│ │ │ ├── dashdash@1.14.1
│ │ │ ├── ecc-jsbn@0.1.1
│ │ │ ├── getpass@0.1.7
│ │ │ ├── jsbn@0.1.1
│ │ │ └── tweetnacl@0.14.5
│ │ ├─┬ sshpk-agent@1.6.1
│ │ │ ├─┬ mooremachine@2.2.0
│ │ │ │ ├── assert-plus@0.2.0
│ │ │ │ └─┬ dtrace-provider@0.8.6
│ │ │ │   └── nan@2.9.2
│ │ │ ├─┬ readable-stream@2.3.5
│ │ │ │ ├── core-util-is@1.0.2
│ │ │ │ ├── inherits@2.0.3
│ │ │ │ ├── isarray@1.0.0
│ │ │ │ ├── process-nextick-args@2.0.0
│ │ │ │ ├── safe-buffer@5.1.1
│ │ │ │ ├── string_decoder@1.0.3
│ │ │ │ └── util-deprecate@1.0.2
│ │ │ └─┬ sshpk@1.13.2
│ │ │   ├── asn1@0.2.3
│ │ │   ├── bcrypt-pbkdf@1.0.1
│ │ │   ├── dashdash@1.14.1
│ │ │   ├── ecc-jsbn@0.1.1
│ │ │   ├── getpass@0.1.7
│ │ │   ├── jsbn@0.1.1
│ │ │   └── tweetnacl@0.14.5
│ │ └─┬ vasync@1.4.3
│ │   ├─┬ jsprim@0.3.0
│ │   │ ├── extsprintf@1.0.0
│ │   │ ├── json-schema@0.2.2
│ │   │ └── verror@1.3.3
│ │   └─┬ verror@1.1.0
│ │     └── extsprintf@1.0.0
│ ├── strsplit@1.0.0
│ ├─┬ tar@2.2.1
│ │ ├── block-stream@0.0.9
│ │ ├─┬ fstream@1.0.11
│ │ │ ├── graceful-fs@4.1.11
│ │ │ ├─┬ mkdirp@0.5.1
│ │ │ │ └── minimist@0.0.8
│ │ │ └─┬ rimraf@2.6.2
│ │ │   └─┬ glob@7.1.2
│ │ │     ├── fs.realpath@1.0.0
│ │ │     ├─┬ inflight@1.0.6
│ │ │     │ └── wrappy@1.0.2
│ │ │     ├─┬ minimatch@3.0.4
│ │ │     │ └─┬ brace-expansion@1.1.11
│ │ │     │   ├── balanced-match@1.0.0
│ │ │     │   └── concat-map@0.0.1
│ │ │     └── path-is-absolute@1.0.1
│ │ └── inherits@2.0.3
│ ├── uuid@2.0.3
│ └─┬ watershed@0.3.4
│   ├─┬ dtrace-provider@0.8.6
│   │ └── nan@2.9.2
│   └── readable-stream@1.0.2
├── posix-getopt@1.2.0
├─┬ vasync@1.6.4
│ └─┬ verror@1.6.0
│   └── extsprintf@1.2.0
└─┬ verror@1.10.0
  └── core-util-is@1.0.2

I spent some time trying different node-manta versions, and I found that the problem was reliably reproducible with node-manta 3.0.0, but not the previous version (2.0.7).

davepacheco commented 6 years ago

In a brief run, here's the distribution of haproxy status codes and termination states found:

$ pbpaste | awk '{print $9, $13 }' | sort | uniq -c
   1 -1 CH--
  68 200 ----
  57 204 ----

That's 68+57 successful GETs and PUTs (200 and 204 status codes) with normal terminations, plus one more problematic entry. Also of note is that this run (which was directed at one loadbalancer) had a lot more than one error:

$ LOG_LEVEL=trace ./bin/mlive -s 2 -S 15 2>log.out
2018-03-13T20:04:01.679Z: reads okay, writes okay (6/6 ok since start)
2018-03-13T20:04:16.710Z: all okay (155/155 ok since 2018-03-13T20:04:01.679Z)
2018-03-13T20:04:26.724Z: all partial_outage (98/105 ok since 2018-03-13T20:04:16.710Z)
2018-03-13T20:04:27.726Z: all okay (6/6 ok since 2018-03-13T20:04:26.724Z)
2018-03-13T20:04:28.728Z: read partial_outage, write okay (10/11 ok since 2018-03-13T20:04:27.726Z)
2018-03-13T20:04:29.729Z: all okay (8/8 ok since 2018-03-13T20:04:28.728Z)
2018-03-13T20:04:30.731Z: all partial_outage (9/11 ok since 2018-03-13T20:04:29.729Z)
2018-03-13T20:04:31.736Z: read okay, write partial_outage (7/8 ok since 2018-03-13T20:04:30.731Z)
2018-03-13T20:04:34.739Z: all okay (28/31 ok since 2018-03-13T20:04:31.736Z)
2018-03-13T20:04:35.741Z: read partial_outage, write okay (7/10 ok since 2018-03-13T20:04:34.739Z)
2018-03-13T20:04:36.746Z: read okay, write partial_outage (11/13 ok since 2018-03-13T20:04:35.741Z)

$ grep -c 'socket hang up' log.out
319

Unfortunately, the interesting haproxy log entry can't be parsed by haplog due to joyent/node-haproxy-log#1, but here's thef ull entry:

2018-03-13T20:04:37+00:00 710f85a9-6f13-4fcf-8f3f-a03c4a508714 haproxy[657137]: ::ffff:172.24.2.24:51143 [13/Mar/2018:20:04:36.746] https secure_api/be0 15/0/1/-1/392 -1 0 - - CH-- 20/4/19/5/0 0/0 "PUT /dap/stor/mlive/mlive_10/obj HTTP/1.1"

The haproxy documentation explains that termination state CH-- means:

 CH   The client aborted while waiting for the server to start responding.
      It might be the server taking too long to respond or the client
      clicking the 'Stop' button too fast.

and specifically, on the two separate characters:

    C : the TCP session was unexpectedly aborted by the client.
    ...
    H : the proxy was waiting for complete, valid response HEADERS from the
        server (HTTP only).

That's certainly strange -- we were waiting on the server, but the client aborted. And we can see that the error happened at 20:04:37, but the TCP connection was only accepted at 20:04:36.746, so there would be no reason for a time out.

davepacheco commented 6 years ago

From the log component "be0", I found the corresponding muskie log. Here's a summary of messages emitted by muskie for that run:

[root@380920d9 (webapi) ~]$ grep 172.24.2.24 /var/log/muskie.log | json -ga msg | sort | uniq -c 
     86 100-continue sent
      1 authorizing...
     81 handled: 200
     59 handled: 204
     27 handled: 408

Of note: there were 86 100-continue responses sent, which matches up with 59 204s and 27 408s. That accounts for all of the uploads. What about the timing of these?

[root@380920d9 (webapi) ~]$ grep 172.24.2.24 /var/log/muskie.log | grep -v 'handled: 200' | json -ga time msg
2018-03-13T20:04:00.906Z 100-continue sent
2018-03-13T20:04:01.264Z handled: 204
2018-03-13T20:04:01.339Z 100-continue sent
2018-03-13T20:04:01.614Z handled: 204
2018-03-13T20:04:02.098Z 100-continue sent
2018-03-13T20:04:02.424Z 100-continue sent
2018-03-13T20:04:02.484Z handled: 204
2018-03-13T20:04:02.744Z handled: 204
2018-03-13T20:04:02.862Z 100-continue sent
2018-03-13T20:04:03.176Z handled: 204
2018-03-13T20:04:03.424Z 100-continue sent
2018-03-13T20:04:03.698Z handled: 204
2018-03-13T20:04:04.082Z 100-continue sent
2018-03-13T20:04:04.406Z handled: 204
2018-03-13T20:04:04.828Z 100-continue sent
2018-03-13T20:04:05.081Z 100-continue sent
2018-03-13T20:04:05.136Z handled: 204
2018-03-13T20:04:05.368Z handled: 204
2018-03-13T20:04:05.532Z 100-continue sent
2018-03-13T20:04:05.766Z handled: 204
2018-03-13T20:04:06.150Z 100-continue sent
2018-03-13T20:04:06.426Z handled: 204
2018-03-13T20:04:06.852Z 100-continue sent
2018-03-13T20:04:07.060Z 100-continue sent
2018-03-13T20:04:07.104Z handled: 204
2018-03-13T20:04:07.438Z handled: 204
2018-03-13T20:04:07.914Z 100-continue sent
2018-03-13T20:04:07.930Z 100-continue sent
2018-03-13T20:04:08.234Z handled: 204
2018-03-13T20:04:08.256Z handled: 204
2018-03-13T20:04:08.692Z 100-continue sent
2018-03-13T20:04:08.936Z handled: 204
2018-03-13T20:04:09.209Z 100-continue sent
2018-03-13T20:04:09.526Z handled: 204
2018-03-13T20:04:10.065Z 100-continue sent
2018-03-13T20:04:10.424Z handled: 204
2018-03-13T20:04:10.522Z 100-continue sent
2018-03-13T20:04:10.844Z handled: 204
2018-03-13T20:04:11.566Z 100-continue sent
2018-03-13T20:04:11.658Z 100-continue sent
2018-03-13T20:04:11.708Z 100-continue sent
2018-03-13T20:04:12.076Z handled: 204
2018-03-13T20:04:12.705Z 100-continue sent
2018-03-13T20:04:13.057Z 100-continue sent
2018-03-13T20:04:13.446Z handled: 204
2018-03-13T20:04:13.605Z 100-continue sent
2018-03-13T20:04:13.994Z 100-continue sent
2018-03-13T20:04:14.170Z 100-continue sent
2018-03-13T20:04:14.255Z handled: 204
2018-03-13T20:04:14.438Z 100-continue sent
2018-03-13T20:04:14.609Z 100-continue sent
2018-03-13T20:04:14.807Z handled: 204
2018-03-13T20:04:14.886Z handled: 204
2018-03-13T20:04:15.216Z 100-continue sent
2018-03-13T20:04:15.566Z handled: 204
2018-03-13T20:04:15.932Z 100-continue sent
2018-03-13T20:04:16.364Z handled: 204
2018-03-13T20:04:16.508Z 100-continue sent
2018-03-13T20:04:16.886Z 100-continue sent
2018-03-13T20:04:17.232Z 100-continue sent
2018-03-13T20:04:17.306Z handled: 204
2018-03-13T20:04:17.555Z 100-continue sent
2018-03-13T20:04:17.650Z 100-continue sent
2018-03-13T20:04:17.886Z handled: 204
2018-03-13T20:04:18.056Z handled: 204
2018-03-13T20:04:18.365Z 100-continue sent
2018-03-13T20:04:19.346Z 100-continue sent
2018-03-13T20:04:19.656Z 100-continue sent
2018-03-13T20:04:19.801Z 100-continue sent
2018-03-13T20:04:20.097Z 100-continue sent
2018-03-13T20:04:20.111Z handled: 204
2018-03-13T20:04:20.553Z 100-continue sent
2018-03-13T20:04:20.791Z 100-continue sent
2018-03-13T20:04:20.876Z handled: 204
2018-03-13T20:04:21.114Z 100-continue sent
2018-03-13T20:04:21.487Z handled: 204
2018-03-13T20:04:21.791Z 100-continue sent
2018-03-13T20:04:21.869Z 100-continue sent
2018-03-13T20:04:22.192Z 100-continue sent
2018-03-13T20:04:22.237Z handled: 204
2018-03-13T20:04:22.523Z handled: 204
2018-03-13T20:04:22.543Z 100-continue sent
2018-03-13T20:04:22.942Z 100-continue sent
2018-03-13T20:04:23.344Z handled: 204
2018-03-13T20:04:23.484Z 100-continue sent
2018-03-13T20:04:23.767Z handled: 204
2018-03-13T20:04:23.932Z 100-continue sent
2018-03-13T20:04:24.176Z handled: 204
2018-03-13T20:04:24.251Z 100-continue sent
2018-03-13T20:04:24.644Z 100-continue sent
2018-03-13T20:04:24.702Z 100-continue sent
2018-03-13T20:04:24.887Z handled: 204
2018-03-13T20:04:24.986Z handled: 204
2018-03-13T20:04:25.385Z 100-continue sent
2018-03-13T20:04:25.392Z 100-continue sent
2018-03-13T20:04:25.724Z handled: 204
2018-03-13T20:04:25.807Z handled: 204
2018-03-13T20:04:26.173Z 100-continue sent
2018-03-13T20:04:26.281Z 100-continue sent
2018-03-13T20:04:26.432Z 100-continue sent
2018-03-13T20:04:26.578Z handled: 204
2018-03-13T20:04:26.797Z handled: 204
2018-03-13T20:04:27.007Z 100-continue sent
2018-03-13T20:04:27.327Z handled: 204
2018-03-13T20:04:27.767Z 100-continue sent
2018-03-13T20:04:28.227Z 100-continue sent
2018-03-13T20:04:28.257Z 100-continue sent
2018-03-13T20:04:28.586Z handled: 204
2018-03-13T20:04:28.598Z handled: 204
2018-03-13T20:04:29.021Z 100-continue sent
2018-03-13T20:04:29.022Z 100-continue sent
2018-03-13T20:04:29.273Z 100-continue sent
2018-03-13T20:04:29.398Z handled: 204
2018-03-13T20:04:29.546Z handled: 204
2018-03-13T20:04:29.818Z 100-continue sent
2018-03-13T20:04:30.221Z 100-continue sent
2018-03-13T20:04:30.377Z 100-continue sent
2018-03-13T20:04:30.500Z 100-continue sent
2018-03-13T20:04:30.704Z handled: 204
2018-03-13T20:04:30.874Z handled: 204
2018-03-13T20:04:31.263Z 100-continue sent
2018-03-13T20:04:31.264Z 100-continue sent
2018-03-13T20:04:31.586Z handled: 204
2018-03-13T20:04:31.877Z 100-continue sent
2018-03-13T20:04:32.112Z 100-continue sent
2018-03-13T20:04:32.245Z 100-continue sent
2018-03-13T20:04:32.427Z handled: 204
2018-03-13T20:04:32.596Z handled: 204
2018-03-13T20:04:33.329Z 100-continue sent
2018-03-13T20:04:33.387Z 100-continue sent
2018-03-13T20:04:33.567Z 100-continue sent
2018-03-13T20:04:33.757Z handled: 204
2018-03-13T20:04:33.866Z handled: 204
2018-03-13T20:04:34.194Z 100-continue sent
2018-03-13T20:04:34.251Z 100-continue sent
2018-03-13T20:04:34.467Z handled: 204
2018-03-13T20:04:34.546Z handled: 204
2018-03-13T20:04:35.216Z 100-continue sent
2018-03-13T20:04:35.505Z handled: 204
2018-03-13T20:04:35.603Z 100-continue sent
2018-03-13T20:04:35.748Z 100-continue sent
2018-03-13T20:04:36.206Z handled: 204
2018-03-13T20:04:36.591Z 100-continue sent
2018-03-13T20:04:37.006Z 100-continue sent
2018-03-13T20:04:37.003Z authorizing...
2018-03-13T20:04:37.228Z 100-continue sent
2018-03-13T20:04:56.569Z handled: 408
2018-03-13T20:04:56.664Z handled: 408
2018-03-13T20:04:57.707Z handled: 408
2018-03-13T20:04:58.610Z handled: 408
2018-03-13T20:04:59.172Z handled: 408
2018-03-13T20:05:01.510Z handled: 408
2018-03-13T20:05:02.236Z handled: 408
2018-03-13T20:05:03.368Z handled: 408
2018-03-13T20:05:04.351Z handled: 408
2018-03-13T20:05:04.658Z handled: 408
2018-03-13T20:05:05.101Z handled: 408
2018-03-13T20:05:05.793Z handled: 408
2018-03-13T20:05:06.796Z handled: 408
2018-03-13T20:05:07.547Z handled: 408
2018-03-13T20:05:09.255Z handled: 408
2018-03-13T20:05:11.175Z handled: 408
2018-03-13T20:05:12.772Z handled: 408
2018-03-13T20:05:14.026Z handled: 408
2018-03-13T20:05:14.819Z handled: 408
2018-03-13T20:05:15.223Z handled: 408
2018-03-13T20:05:16.268Z handled: 408
2018-03-13T20:05:16.881Z handled: 408
2018-03-13T20:05:18.333Z handled: 408
2018-03-13T20:05:20.608Z handled: 408
2018-03-13T20:05:21.592Z handled: 408
2018-03-13T20:05:22.010Z handled: 408
2018-03-13T20:05:22.233Z handled: 408

On the client side, "socket hang-up" errors begin at 20:04:10.158Z and go on until 20:04:37.035Z, which is just shy of when I killed the client. There are about 10 errors per second. In retrospect, it looks like the "CH--" termination state may be a red herring, in that that's the time I killed the client, and we had lots of client errors before that point.

While digging through the client log, I also noticed errors like this one:

[2018-03-13T20:04:25.797Z] DEBUG: mlive/MantaClient/73073 on zathras.local: put: error (req_id=1089a055-df9c-42fa-857f-1b233db7c130, path=/dap/stor/mlive/mlive_22/obj)
    ConnectTimeoutError: connect timeout after 4000ms
        at connectTimeout [as _onTimeout] (/Users/dap/work/manta-mlive-testing/node_modules/manta/node_modules/restify-clients/lib/HttpClient.js:142:23)
        at Timer.listOnTimeout [as ontimeout] (timers.js:121:15)

Here are the client messages related to that request:

$ grep 1089a055-df9c-42fa-857f-1b233db7c130 log.out  | json -ga time msg
2018-03-13T20:04:06.286Z put: entered
2018-03-13T20:04:06.288Z request sent
2018-03-13T20:04:10.792Z request sent
2018-03-13T20:04:15.793Z request sent
2018-03-13T20:04:21.795Z request sent
2018-03-13T20:04:25.797Z put: error

Only one Muskie reports anything related to this request id, and it's an UploadTimeoutError at 20:04:59.883Z:

[2018-03-13T20:04:59.883Z]  INFO: muskie/HttpServer/136473 on 204ac483-7e7e-4083-9ea2-c9ea22f459fd (/opt/smartdc/muskie/lib/audit.js:300 in audit): handled: 408 (audit=true, _audit=true, operation=putobject, billable_operation=PUT, bytesTransferred=53687091200, logicalRemoteAddress=172.24.2.24, reqHeaderLength=788, resHeaderLength=337, latency=45374, prevObjectId=0114b2ff-5ab6-48da-a5c8-afc1ce6674a6, objectId=2c2317a6-cf8e-4c32-dce8-f1b4f4f20def, entryShard=tcp://2.moray.staging.joyent.us:2020, parentShard=tcp://2.moray.staging.joyent.us:2020, route=putobject, latencyToFirstByte=374, req.owner=bc8cd146-fecb-11e1-bd8a-bb6f54b49808)
    PUT /dap/stor/mlive/mlive_22/obj HTTP/1.1
    accept: application/json
    content-type: application/x-tgif
    expect: 100-continue
    x-request-id: 1089a055-df9c-42fa-857f-1b233db7c130
    transfer-encoding: chunked
    date: Tue, 13 Mar 2018 20:04:06 GMT
    authorization: Signature keyId="/dap/keys/56:f3:e1:56:3d:e6:f7:83:a9:ce:19:5d:62:ba:5c:1f",algorithm="rsa-sha1",headers="date",signature="XrUUvC6bmhCzt1RPzdLLb08McrJtffiDCIHO7ArDQhd7KIAVxXt+NrXzAThPkH6mPm1TRSaQSfWZP+oQ0aHgJ1rGDMqPL2tUCVuGhWRyBvlAJYW5RoWFftT/SGFoEj8uEiYO84SyGOiNuW5dv0kNR3Aao15DGshQ3T0Sl56gs4v7Bxl0B+pzRBaKdkNZL0JWM1rsBcIW73rghHDv1hG/WXLBB0yFZX1qzWcE3eJs3o4JkPTYv9GDgBCd/us4l9yxNW+97Yobq1t7OFLI9Yu+mLbAiG0U4X9MWx+ogFxKgntOaQ5ZbUA43tM6JebqyikqoiQOFm+fhSq5zI1jQZFBlw=="
    user-agent: restify/1.3.3 (x64-darwin; v8/3.14.5.11; OpenSSL/1.0.1u) node/0.10.48
    accept-version: ~1.0
    host: 172.26.5.11
    connection: keep-alive
    --
    HTTP/1.1 408 Request Time-out
    etag: 0114b2ff-5ab6-48da-a5c8-afc1ce6674a6
    last-modified: Tue, 13 Mar 2018 20:04:05 GMT
    connection: close
    content-type: application/json
    content-length: 71
    content-md5: ncgxc5d7/Q5/2NeDgNTTDQ==
    date: Tue, 13 Mar 2018 20:04:59 GMT
    server: Manta
    x-request-id: 1089a055-df9c-42fa-857f-1b233db7c130
    x-response-time: 45374
    x-server-name: 204ac483-7e7e-4083-9ea2-c9ea22f459fd
    --
    UploadTimeoutError: request took too long to send data
        at CheckStream.<anonymous> (/opt/smartdc/muskie/lib/obj.js:572:18)
        at CheckStream.g (events.js:180:16)
        at CheckStream.emit (events.js:92:17)
        at CheckStream.onTimeoutHandler (/opt/smartdc/muskie/lib/check_stream.js:47:10)
        at Timer.listOnTimeout [as ontimeout] (timers.js:121:15)
    --
    sharksContacted: [
      {
        "shark": "1.stor.staging.joyent.us",
        "result": null,
        "timeToFirstByte": 6,
        "timeTotal": null,
        "_startTime": 1520971454872
      },
      {
        "shark": "2.stor.staging.joyent.us",
        "result": null,
        "timeToFirstByte": 5,
        "timeTotal": null,
        "_startTime": 1520971454874
      }
    ]
    --
    req.caller: {
      "login": "dap",
      "uuid": "bc8cd146-fecb-11e1-bd8a-bb6f54b49808",
      "groups": [],
      "user": null
    }
    --
    req.timers: {
      "earlySetup": 35,
      "parseDate": 27,
      "parseQueryString": 27,
      "handler-3": 164,
      "checkIfPresigned": 5,
      "enforceSSL": 8,
      "ensureDependencies": 5,
      "_authSetup": 7,
      "preSignedUrl": 4,
      "checkAuthzScheme": 5,
      "parseAuthTokenHandler": 39,
      "signatureHandler": 133,
      "parseKeyId": 228,
      "loadCaller": 540,
      "verifySignature": 1253,
      "parseHttpAuthToken": 8,
      "loadOwner": 7135,
      "getActiveRoles": 55,
      "gatherContext": 33,
      "setup": 851,
      "getMetadata": 223092,
      "storageContext": 35,
      "authorize": 856,
      "checkIfMatch": 31,
      "checkIfNoneMatch": 12,
      "checkIfModified": 35,
      "checkIfUnmodified": 12,
      "ensureNotRoot": 28,
      "parseArguments": 1286,
      "ensureNotDirectory": 19,
      "ensureParent": 624,
      "enforceDirectoryCount": 124746,
      "findSharks": 1619,
      "startSharkStreams": 8429,
      "sharkStreams": 45003064
    }

The Date header appears to have been constructed for the first request attempt at 20:04:06, and latencyToFirstByte is only 374ms -- again, not enough to suggest a client timeout.

There are some other log entries in the muskie log for the same path that don't have request ids. (That seems like a bug.) It's hard to be sure they're for the same request, but one has latency=373, which seems like it might reflect that it's the same request. (If either side logged the TCP port used, that would help here, too.) On the other hand, it's 7 seconds later than it should be:

[2018-03-13T20:04:14.881Z]  INFO: muskie/136473 on 204ac483-7e7e-4083-9ea2-c9ea22f459fd (/opt/smartdc/muskie/lib/obj.js:598 in sharkStreams): 100-continue sent (method=PUT, path=/dap/stor/mlive/mlive_22/obj, remoteAddress=172.24.2.24, latency=373, audit_100=true)

So for this request, we have:

20:04:06.286Z client reports: put: entered
20:04:06.288Z client reports: request sent
20:04:10.792Z client reports: request sent
20:04:14.881Z server reports: 100-continue (possibly for the same request?)
20:04:15.793Z client reports: request sent
20:04:21.795Z client reports: request sent
20:04:25.797Z client reports: put: error (ConnectTimeoutError after 4s)
20:04:59.883Z server reports: 408 response (time out)

Things I still don't understand:

It's interesting to note that the server reports the final latency as about 45s, which is consistent with having processed the 100-continue 45 seconds earlier with a latency of only 373ms. That seems to suggest that the request was delayed over 7 seconds in reaching Muskie. Since we have no errors reported from more recent clients, maybe this is happening on the client side?

davepacheco commented 6 years ago

Client-side queueing on Node 0.10 reminded me of various maxSockets issues on Node v0.10. That led me to MANTA-2981. Quoting from my analysis there:

The root cause here looks to be that the underlying restify client is using only 5 sockets, but we're making requests with concurrency=25. These requests often take tens of seconds, so later ones get queued behind requests that are going to take many seconds, and they exceed the built-in timeout of 4 seconds for acquiring a socket.

The problem is that node-restify configures its maxSockets based on the Node HTTP agent's maxSockets -- and it does so at the top level, when restify's lib/http.js is first required. ... In Node 0.10, the default maxSockets is 5, so restify grabs that value for httpMaxSockets. Later, when you create a restify client without specifying an agent, it creates a keep-alive-agent with maxSockets = 5.

Now, node-manta's "lib/create_client.js" already does set maxSockets to a large value at the top-level, but well after having required the restifyClients module, which is too late. I moved the block that sets maxSockets to the top of the file, and mlive is running happily again on Node 0.10.

$ ./bin/mlive -s 2 -S 15 
will stomp over directory at "/dap/stor/mlive"
will test for liveness: reads, writes
assuming 2 metadata shards
using base paths: /dap/stor/mlive
time between requests: 50 ms
maximum outstanding requests: 100
environment:
    MANTA_USER = dap
    MANTA_KEY_ID = 56:f3:e1:56:3d:e6:f7:83:a9:ce:19:5d:62:ba:5c:1f
    MANTA_URL = https://172.26.5.11
    MANTA_TLS_INSECURE = 1
skipping directory tree setup
2018-03-13T20:58:31.590Z: reads okay, writes stuck (6/6 ok since start)
2018-03-13T20:58:32.595Z: all okay (23/23 ok since 2018-03-13T20:58:31.590Z)
2018-03-13T20:58:47.627Z: all okay (287/287 ok since 2018-03-13T20:58:32.595Z)
2018-03-13T20:59:02.655Z: all okay (285/285 ok since 2018-03-13T20:58:47.627Z)
2018-03-13T20:59:17.685Z: all okay (287/287 ok since 2018-03-13T20:59:02.655Z)
2018-03-13T20:59:32.728Z: all okay (284/284 ok since 2018-03-13T20:59:17.685Z)
2018-03-13T20:59:47.750Z: all okay (287/287 ok since 2018-03-13T20:59:32.728Z)
2018-03-13T21:00:02.773Z: all okay (285/285 ok since 2018-03-13T20:59:47.750Z)
2018-03-13T21:00:17.816Z: all okay (287/287 ok since 2018-03-13T21:00:02.773Z)
2018-03-13T21:00:32.843Z: all okay (286/286 ok since 2018-03-13T21:00:17.816Z)
2018-03-13T21:00:47.875Z: all okay (288/288 ok since 2018-03-13T21:00:32.843Z)
2018-03-13T21:01:02.901Z: all okay (287/287 ok since 2018-03-13T21:00:47.875Z)
2018-03-13T21:01:17.936Z: all okay (284/284 ok since 2018-03-13T21:01:02.901Z)
2018-03-13T21:01:32.959Z: all okay (288/288 ok since 2018-03-13T21:01:17.936Z)
2018-03-13T21:01:47.999Z: all okay (286/286 ok since 2018-03-13T21:01:32.959Z)
2018-03-13T21:02:03.025Z: all okay (287/287 ok since 2018-03-13T21:01:47.999Z)
2018-03-13T21:02:18.045Z: all okay (286/286 ok since 2018-03-13T21:02:03.025Z)
2018-03-13T21:02:33.072Z: all okay (289/289 ok since 2018-03-13T21:02:18.045Z)

I believe this explains all of the data so far, including:

I also confirmed by hand that:

$ diff node_modules/manta/lib/create_client.js.old node_modules/manta/lib/create_client.js
8a9,13
> 
> 
> var MAX_SOCKETS = parseInt(process.env.MANTA_MAX_SOCKETS || 1000, 10);
> require('http').globalAgent.maxSockets = MAX_SOCKETS;
> require('https').globalAgent.maxSockets = MAX_SOCKETS;
375,378d379
< 
< var MAX_SOCKETS = parseInt(process.env.MANTA_MAX_SOCKETS || 1000, 10);
< require('http').globalAgent.maxSockets = MAX_SOCKETS;
< require('https').globalAgent.maxSockets = MAX_SOCKETS;

At this point, I think it's fair to say that this is a regression in node-manta that was inadvertently introduced in 3.0.0. Fortunately, the fix is pretty straightforward and should be low risk. I will shortly be moving all of this data into a node-manta issue, leaving this issue to cover updating the node-manta dependency to a version with the fix.

jclulow commented 6 years ago

@dap presumably this is the same thing as was fixed with lib/maxsockets.js in Mola?

davepacheco commented 6 years ago

It's the same root issue. What was nastier in this case is that node-manta integrated a fix for the same issue back in 2012 under MANTA-526. Unfortunately, it appears that a combination of restify's design (saving the value of maxSockets at require-time) and a reordering of require statements with 3.0.0 defeated that fix here.