Type-checking dhall-kubernetes-charts with an empty cache takes 25 minutes

sjakobi commented 5 years ago

Checkout https://github.com/Gabriel439/dhall-kubernetes-charts/commit/33eb92e4d55224bb103c19b1fb4e56e16060b3d1
$ rm -rf ~/.cache/dhall*

$ time dhall type --file stable/jenkins/index.dhall --quiet
75.84user 6.99system 24:59.22elapsed 5%CPU (0avgtext+0avgdata 235372maxresident)k
30712inputs+48368outputs (131major+82643minor)pagefaults 0swaps

Note the huge difference between user time and elapsed time: more than 23 minutes! I'll try to figure out what the process is waiting for there.

sjakobi commented 5 years ago

Here are some syscall statistics:

$ strace -c -w dhall type --file stable/jenkins/index.dhall --quiet
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.44 1277.662792        3936    324641     93352 select
  0.40    5.304978           5   1008713           read
  0.37    4.909465           5   1015626           fstat
  0.37    4.797701         895      5358       498 poll
  0.32    4.180177           8    546307     21477 stat
  0.27    3.550268           7    512881           open
  0.21    2.703619           5    517960           close
  0.18    2.401575           5    514497    511256 ioctl
  0.14    1.896009           6    343727           rt_sigprocmask
  0.11    1.464082          14    106277     94401 rt_sigreturn
  0.07    0.933065           5    171865           clock_gettime
  0.04    0.529798          81      6532           getdents
  0.02    0.303407          10     30999      4950 recvfrom
  0.01    0.137467          12     11340           sendto
  0.01    0.121909           9     13734        15 access
  0.01    0.072200          33      2218           timer_settime
  0.00    0.043991          22      2033           write
  0.00    0.043851           9      5079           socket
  0.00    0.043813          13      3465      1638 connect
  0.00    0.040473          25      1620           getsockopt
  0.00    0.038667          24      1620           sendmmsg
  0.00    0.038470          19      2033           ftruncate
  0.00    0.033649           7      4860           recvmsg
  0.00    0.013780          20       685           madvise
  0.00    0.010869           7      1620           bind
  0.00    0.009235           5      1824           getsockname
  0.00    0.005717          18       321           mmap
  0.00    0.000584           6       100           setsockopt
  0.00    0.000210           8        28           mprotect
  0.00    0.000106         106         1           execve
  0.00    0.000087           4        20           lseek
  0.00    0.000085           5        18           getcwd
  0.00    0.000077          15         5           brk
  0.00    0.000061           4        14           rt_sigaction
  0.00    0.000059          15         4           munmap
  0.00    0.000059          30         2           mkdir
  0.00    0.000013           7         2           chmod
  0.00    0.000011           6         2           futex
  0.00    0.000009           9         1           timer_delete
  0.00    0.000008           4         2           getrlimit
  0.00    0.000006           6         1           sysinfo
  0.00    0.000006           6         1           timer_create
  0.00    0.000005           5         1           set_tid_address
  0.00    0.000004           4         1           arch_prctl
  0.00    0.000004           4         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 1311.292421               5158039    727587 total

That the vast majority of the time is spent in select seems to indicate that the program is waiting for (network) input.

I'd like to figure out how the time spent waiting for requests to complete is distributed, and which requests take the longest time. Are there good tools to help me with that?

Gabriella439 commented 5 years ago

@sjakobi: You can set the https_proxy environment variable to forward all Dhall traffic through a locally running forward proxy that logs network requests

sjakobi commented 5 years ago

You can set the https_proxy environment variable to forward all Dhall traffic through a locally running forward proxy that logs network requests

Cheers, I can try that! Any suggestions for a forward proxy for Linux?

Gabriella439 commented 5 years ago

@sjakobi: The two most common ones that I know of are squid and varnish but it's also not that hard to write a custom one using Haskell (e.g. using the http-proxy package or something similar)

sjakobi commented 5 years ago

I ended up installing mitmproxy, and using its browser GUI mitmweb.

I then ran dhall with

$ env https_proxy=127.0.0.1:8080 SYSTEM_CERTIFICATE_PATH=/home/simon/.mitmproxy dhall type --file stable/jenkins/index.dhall --quiet

I noticed

Requests to https://prelude.dhall-lang.org all take 800ms to 1s, which I find pretty slow – but since there are only ~120 files, we're done after < 2 minutes.
Requests for https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/* mostly take 250 - 400 ms, but every 5-20 requests there's one that takes 1s and sometimes 3s. This makes me wonder whether there's some kind of throttling at play.
In a last run with the mitmdump CLI tool, the process ended like this:

$ env https_proxy=127.0.0.1:8080 SYSTEM_CERTIFICATE_PATH=/home/simon/.mitmproxy dhall type --file stable/jenkins/index.dhall --quiet
dhall: 
Error: The remote host took too long to respond

URL: https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/4ef46543ceff0e10978cca542952c56ee6e47b93/types/io.k8s.api.core.v1.NamespaceStatus.dhall

Tail of mitmdump output

``` 127.0.0.1:34334: GET https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/4ef46543ceff0e10978cca542952c56ee6e47b93/types/io.k8s.api.core.v1.LocalVolumeSource.dhall << 200 OK 55b 127.0.0.1:34336: clientconnect 127.0.0.1:34168: clientdisconnect 127.0.0.1:34336: GET https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/4ef46543ceff0e10978cca542952c56ee6e47b93/types/io.k8s.api.core.v1.NFSVolumeSource.dhall << 200 OK 68b 127.0.0.1:34170: clientdisconnect 127.0.0.1:34338: clientconnect 127.0.0.1:34172: clientdisconnect 127.0.0.1:34174: clientdisconnect 127.0.0.1:34176: clientdisconnect 127.0.0.1:34178: clientdisconnect 127.0.0.1:34180: clientdisconnect 127.0.0.1:34182: clientdisconnect 127.0.0.1:34184: clientdisconnect 127.0.0.1:34186: clientdisconnect 127.0.0.1:34188: clientdisconnect 127.0.0.1:34190: clientdisconnect 127.0.0.1:34192: clientdisconnect 127.0.0.1:34194: clientdisconnect 127.0.0.1:34196: clientdisconnect 127.0.0.1:34198: clientdisconnect 127.0.0.1:34200: clientdisconnect 127.0.0.1:34202: clientdisconnect 127.0.0.1:34204: clientdisconnect 127.0.0.1:34206: clientdisconnect 127.0.0.1:34208: clientdisconnect 127.0.0.1:34210: clientdisconnect 127.0.0.1:34212: clientdisconnect 127.0.0.1:34214: clientdisconnect 127.0.0.1:34216: clientdisconnect 127.0.0.1:34218: clientdisconnect 127.0.0.1:34220: clientdisconnect 127.0.0.1:34222: clientdisconnect 127.0.0.1:34224: clientdisconnect 127.0.0.1:34226: clientdisconnect 127.0.0.1:34228: clientdisconnect 127.0.0.1:34230: clientdisconnect 127.0.0.1:34232: clientdisconnect 127.0.0.1:34234: clientdisconnect 127.0.0.1:34236: clientdisconnect 127.0.0.1:34238: clientdisconnect 127.0.0.1:34240: clientdisconnect 127.0.0.1:34242: clientdisconnect 127.0.0.1:34338: GET https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/4ef46543ceff0e10978cca542952c56ee6e47b93/types/io.k8s.api.core.v1.Namespace.dhall << 200 OK 153b 127.0.0.1:34340: clientconnect 127.0.0.1:34244: clientdisconnect 127.0.0.1:34246: clientdisconnect 127.0.0.1:34248: clientdisconnect 127.0.0.1:34250: clientdisconnect 127.0.0.1:34340: GET https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/4ef46543ceff0e10978cca542952c56ee6e47b93/types/io.k8s.api.core.v1.NamespaceList.dhall << 200 OK 138b 127.0.0.1:34342: clientconnect 127.0.0.1:34252: clientdisconnect 127.0.0.1:34254: clientdisconnect 127.0.0.1:34256: clientdisconnect 127.0.0.1:34258: clientdisconnect 127.0.0.1:34342: GET https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/4ef46543ceff0e10978cca542952c56ee6e47b93/types/io.k8s.api.core.v1.NamespaceSpec.dhall << 200 OK 47b 127.0.0.1:34344: clientconnect 127.0.0.1:34260: clientdisconnect 127.0.0.1:34262: clientdisconnect 127.0.0.1:34264: clientdisconnect 127.0.0.1:34266: clientdisconnect 127.0.0.1:34268: clientdisconnect 127.0.0.1:34270: clientdisconnect 127.0.0.1:34272: clientdisconnect 127.0.0.1:34274: clientdisconnect 127.0.0.1:34276: clientdisconnect 127.0.0.1:34278: clientdisconnect 127.0.0.1:34280: clientdisconnect 127.0.0.1:34282: clientdisconnect 127.0.0.1:34284: clientdisconnect 127.0.0.1:34286: clientdisconnect 127.0.0.1:34288: clientdisconnect 127.0.0.1:34290: clientdisconnect 127.0.0.1:34292: clientdisconnect 127.0.0.1:34294: clientdisconnect 127.0.0.1:34296: clientdisconnect 127.0.0.1:34298: clientdisconnect 127.0.0.1:34300: clientdisconnect 127.0.0.1:34302: clientdisconnect 127.0.0.1:34304: clientdisconnect 127.0.0.1:34306: clientdisconnect 127.0.0.1:34308: clientdisconnect 127.0.0.1:34310: clientdisconnect 127.0.0.1:34312: clientdisconnect 127.0.0.1:34314: clientdisconnect 127.0.0.1:34316: clientdisconnect 127.0.0.1:34318: clientdisconnect 127.0.0.1:34320: clientdisconnect 127.0.0.1:34322: clientdisconnect 127.0.0.1:34324: clientdisconnect 127.0.0.1:34326: clientdisconnect 127.0.0.1:34328: clientdisconnect 127.0.0.1:34330: clientdisconnect 127.0.0.1:34332: clientdisconnect 127.0.0.1:34334: clientdisconnect 127.0.0.1:34336: clientdisconnect 127.0.0.1:34338: clientdisconnect 127.0.0.1:34342: clientdisconnect 127.0.0.1:34340: clientdisconnect 127.0.0.1:34344: GET https://raw.githubusercontent.com/dhall-lang/dhall-kubernetes/4ef46543ceff0e10978cca542952c56ee6e47b93/types/io.k8s.api.core.v1.NamespaceStatus.dhall << 200 OK 46b 127.0.0.1:34344: CONNECT raw.githubusercontent.com:443 << Error in HTTP connection: TcpDisconnect("(32, 'EPIPE')",) 127.0.0.1:34344: clientdisconnect ```

Note the large number of clientdisconnects at the end – I wonder whether that's some kind of resource leak that causes the server to block me.

While looking at the strace output I also noticed that we seem to re-read the SSL certificates on each request – it doesn't take long in total, but it might be worth optimizing anyways.

sjakobi commented 5 years ago

we seem to re-read the SSL certificates on each request

That's probably because we construct a new Manager for every request:

https://github.com/dhall-lang/dhall-haskell/blob/e931451a2b2892fa373c9dfc94b31bcd764131be/dhall/ghc-src/Dhall/Import/HTTP.hs#L261-L263

I seem to remember that we didn't always do that, and I wonder whether it impairs our ability to handle HTTP connections…

Gabriella439 commented 5 years ago

@sjakobi: Yeah, we should only be creating a Manager once. If we're doing so for every request that's a bug.

Regarding the prelude.dhall-lang.org slowdown, it's because prelude.dhall-lang.org makes a second request to GitHub to serve the file because it's not caching requests. You can find the nginx configuration we currently use here:

https://github.com/dhall-lang/dhall-lang/blob/be63f373f6989dbb922822343550314748358328/nixops/logical.nix#L170-L199

I can enable simple HTTP caching for prelude.dhall-lang.org for all requests for individual version-pinned Prelude files.

sjakobi commented 4 years ago

After #1522:

$ env XDG_CACHE_HOME=.cache time dhall type --file stable/jenkins/index.dhall --quiet
4.15user 1.85system 4:53.92elapsed 2%CPU (0avgtext+0avgdata 233076maxresident)k
176inputs+48368outputs (0major+53820minor)pagefaults 0swaps

5x faster for me. Thanks @Gabriel439! :)

I'll try to figure out what else we could do to speed that up.

Gabriella439 commented 4 years ago

@sjakobi: I still need to enable server-side caching for the Prelude, although I'm not sure how much of an effect that will have

Gabriella439 commented 4 years ago

So I took a stab at nginx server-side caching and was not successful, so I'm going to put it down for now unless we have evidence that this would significantly improve cold startup speeds. If somebody else wants to try to pick this up I pushed a branch with the nginx configuration that I tried:

https://github.com/dhall-lang/dhall-lang/compare/gabriel/server_side_caching

Profpatsch commented 4 years ago

Are the network requests parallelized? Every nth one taking 1–3s shouldn’t matter if we do 10–50 at once. Maybe there’s a way to take advantage of HTTP/2 here?

Gabriella439 commented 4 years ago

@Profpatsch: They are currently not parallelized, but I believe they can be parallelized with some work

Gabriella439 commented 4 years ago

I forgot to mention here that one idea I have to address this is to create a proxy.dhall-lang.org that interpreters can optionally use to as a forward proxy to download pre-resolved (and possibly pre-normalized) Dhall expressions. See: https://discourse.dhall-lang.org/t/rfc-proxy-dhall-lang-org/144

evanrelf commented 3 years ago

Hi there! I'm a long time lurker but inexperienced user 👋 Just popping in to leave my 2 cents of feedback.

I was trying to evaluate a Dhall snippet from Gabriel's slides, on a fresh install of the latest dhall-haskell (1.37.1) (so nothing in cache), and it made for a confusing new user experience.

Downloading https://prelude.dhall-lang.org/v20.0.0/package.dhall and all its dependencies took several minutes, and dhall provided no feedback on what it was doing, so I was left staring at a blank screen.

The lack of feedback was really confusing. Since dhall with no command-line options behaves kind of like cat, I thought it was waiting for more input, or a ^D or something. After a while I started troubleshooting my Dhall installation ("Wrong version? Maybe I'll try the static binaries instead of using what Nixpkgs gives me?"). Thankfully I eventually let it sit long enough to see it finally evaluate and realize my mistake.

I think it would be really helpful to have some kind of additional feedback about what dhall is doing in cases like this. A progress bar would be cool, but even just a single log line saying it's downloading files would've cleared up my confusion. I also tried to reach for a --verbose flag while troubleshooting, but the dhall executable doesn't provide one. That might be a good incremental step.

EDIT: Probably should've left this comment on https://github.com/dhall-lang/dhall-haskell/issues/2062, but oh well.

Gabriella439 commented 3 years ago

@evanrelf: The main reason I haven't done this yet is because I'd like to first fix the original problem (the imports being slow) as much as possible before adding workarounds like logging or progress bars (although I acknowledge at some point we will need them).

One of the things I've been doing along these lines is setting up a global cache of binary-encoded and fully-resolved downloads for common Dhall packages (especially the Prelude) that Dhall interpreters can use to efficiently retrieve packages in a single HTTP request.

You can track the progress of that work here: https://discourse.dhall-lang.org/t/rfc-proxy-dhall-lang-org/144/10

dhall-lang / dhall-haskell

Type-checking dhall-kubernetes-charts with an empty cache takes 25 minutes #1511