Startup time for CLI applications

jjallaire commented 3 years ago

Writing to see if there is something we can do about startup time for Deno CLI applications. We have a moderately sized CLI application (25k lines of TypeScript code) that nevertheless takes 1.2s to startup and print its help (i.e. not executing any of its main codepaths). This compared to deno help taking 0.01s on the same system (a 2019 MB Pro w/ 2.4GHz i9).

Our application is Quarto (https://quarto.org) which is a Pandoc-based static site generator focused on scientific and technical writing.

We have a few concerns about startup time given that quarto commands are intended to be run at "interactive speed":

1) I'd imagine that on less capable systems this time time could be upwards of 2 or 3 seconds;

2) As our LOC grows (likely by 2 or 3x) it seems like the latency will grow linearly (as we've noted that a simple 'hello.js' script can be run by deno at approxmiately the speed of deno help);

3) Our tool (and I'd surmise many other developer oriented CLIs) are often called from IDEs (e.g. in save hooks) where latency can be a problem.

4) CLI tool invocations are often stacked together (e.g. think of bash scripts that orchestrate git workflows) so the per-invocation overhead can really, really add up.

As as aside, I think deno has huge potential for creation of dependency-free developer-oriented CLIs. The only other viable options are Rust (which is not as broadly accessible as JS) and Go (which must be learned afresh for many devs and ultimately won't be able to keep up w/ the JS library ecosystem). I don't know if this is a major priority for the deno team (as server applications are likely to predominate over CLI applications). In the case that it is I think this is an issue that would be a show-stopper for many applications.

I can think of a few places to go from here:

1) We are somehow doing something wrong (e.g. not availing ourselves of tree shaking or otherwise bundling in a non-optimal fashion). We currently just use deno bundle --unstable --import-map <xxx> src/quarto.ts (where quarto.ts is our main entry point).

2) A future deno feature will address this. My understanding is that deno overcame its own startup latency issues by using v8 snapshots. It looks there has been discussion of loading custom shapshots here (https://github.com/denoland/deno/issues/1877) but there are issues with ES module loading that prevent this. There has also been discussion of creating v8 snapshots during compile (https://github.com/denoland/deno/issues/8820). Is there any prospect of either of these features being implemented in the near term?

3) To overcome this we should be building a custom rust binary that uses deno_core and our own snapshot (as discussed here: https://github.com/denoland/deno/issues/4402). If this is the only practical solution available in the near term any pointers you have on examples or documentation related to this would be very, very much appreciated.

We have been incredibly happy with Deno as a development platform and want to be all-in on using it for the lifetime of our project. Given this, we'd be more than happy to fund work on any of the above items if that is helpful.

lucacasonato commented 3 years ago

Have you tried starting with --no-check? This disables type checking, which takes up the majority of startup time.

jjallaire commented 3 years ago

Yes we have tried with --no-check and the timings are the same (these timings are running against a JS file created with deno bundle so I'm guessing that type checking doesn't come into play in any case).

kitsonk commented 3 years ago

I ran deno info

and got the following:

type: TypeScript
dependencies: 603 unique (total 5.38MB)

The bundle output is 3.73MB.

The bundler is not "advanced" enough to do deep "tree shaking" like with namespace imports unlike some advanced bundlers which will try to determine what out of the namespace is used and elide that.

Also, I notice there are four top level awaits in the output bundle. These effectively would be a blocking action to the execution and would appear to be counted as start-up time.

One of them is bootstrapping the logger: https://github.com/denoland/deno_std/blob/455077a9053f170712337ff7b527e02860a4d03b/log/mod.ts
There are two copies of the std/hash (@0.91.0 and @0.97.0) that init the wasm using TLA. (FYI, I was able to count 5 different versions of parts of the std library, 0.91.0, 0.93.0, 0.97.0, 0.101.0 and 0.105.0), you could consider adding those to the import map to collapse those down to a single version you know that works)
I suspect the biggest offender though is https://deno.land/x/deno_dom@v0.1.13-alpha which has a 760KB wasm encoded as base64, which it decodes and loads as TLA.

My opinion is the following:

We shouldn't bootstrap things in std via TLA, and instead require the user to actually invoke an init/setup. Just requiring in a module should commit you to a blocking action.
deno_dom should consider something similar. It is reasonable to expect a "setup" step for a library like that to bootstrap the web assembly to avoid blocking startup and allowing the consumer to pick their pain point.
While the pattern of base64 encoding the WASM, loading it, decoding it at startup, is a major pain point. It is the most portable and is cache-able, but it isn't ideal for production situations, as it is horribly inefficient. The problem is there is no easy way to address this yet, but something I think we as the core team need to try to figure out. I will open an issue about it.

kitsonk commented 3 years ago

Opened #11918 for dealing with loading web assembly.

jjallaire commented 3 years ago

It so happens that deno_dom also has a Rust version of the library (currently only provided as a legacy Deno plugin so I'm using it w/ 1.12.2). I switched to the Rust version and 1.2s went to .83s (so roughly 30% of the time was being spent in deno_dom WASM loading). I'll try eliminating the redundant std modules and report back on what that does, but it sounds like if the WASM being loaded outside of deno_dom is on the order of another 1MB that would explain nearly all of the startup time.

I'm gleaning from this discussion that v8 snapshots might not in fact be helpful here? Is that correct, or could it be a combination of improving WASM loading and v8 snapshotting that brings the startup time closer to negligible?

kitsonk commented 3 years ago

I'm gleaning from this discussion that v8 snapshots might not in fact be helpful here?

Hard to say for sure. We have to yet get ESM working with snapshots, and then layer on the complexity of TLA, there are technical challenges to support them as general purpose user code. My supposition is that it is TLA that is the critical path here, because it effectively blocks module evaluation until the top level promises resolve and would directly impact the user perceived startup time. Snapshots wouldn't make that go away (and might even be a barrier to using snapshots at all).

jjallaire commented 3 years ago

If we could eliminate our usages of TLA (haven't looked closely at feasibility of this yet) would that unblock better parallelism for the WASM loading? (i.e. would it allow user code to execute immediately and WASM only be loaded on demand?). Apologize if the answer to this is already above I am not conversant enough w/ the architecture to follow all the implications of things.

kitsonk commented 3 years ago

@jjallaire there are 4 instances of top-level-await in your code, none of them are coming directly from your code. One of them is bootstrapping std/log. It is unlikely super critical, but we should consider eliminating that and simply providing a user API to setup the logger, instead of "forcing" it on the consumer.

The other 3 are all loading WASM, of which std/hash is duplicated across two different versions of std. That is really the only one that can be dealt with directly by your project at the moment. The biggest culprit is how deno_dom loads its WASM and all of these should try to be addressed by #11918, as what deno_dom isn't doing is "wrong", as we don't have a good way of creating a bundable portable loadable way to load WASM at the moment without using top-level-await and base64 encoding the module., which is very slow and blocks the module evaluation.

jjallaire commented 3 years ago

Here is our import map (https://github.com/quarto-dev/quarto-cli/blob/main/src/import_map.json):

{
  "imports": {
    "async/": "https://deno.land/std@0.97.0/async/",
    "fmt/": "https://deno.land/std@0.97.0/fmt/",
    "flags/": "https://deno.land/std@0.97.0/flags/",
    "log/": "https://deno.land/std@0.97.0/log/",
    "path/": "https://deno.land/std@0.97.0/path/",
    "fs/": "https://deno.land/std@0.97.0/fs/",
    "hash/": "https://deno.land/std@0.97.0/hash/",
    "io/": "https://deno.land/std@0.97.0/io/",
    "encoding/": "https://deno.land/std@0.97.0/encoding/",
    "uuid/": "https://deno.land/std@0.97.0/uuid/",
    "testing/": "https://deno.land/std@0.97.0/testing/",
    "http/": "https://deno.land/std@0.97.0/http/",
    "signal/": "https://deno.land/std@0.97.0/signal/",
    "ws/": "https://deno.land/std@0.97.0/ws/",
    "textproto/": "https://deno.land/std@0.97.0/textproto/",
    "datetime/": "https://deno.land/std@0.97.0/datetime/",
    "xmlp/": "https://deno.land/x/xmlp@v0.2.8/",
    "cliffy/": "https://deno.land/x/cliffy@v0.19.3/",
    "lodash/": "https://deno.land/x/deno_lodash@v0.1.0/",
    "fuse/": "https://deno.land/x/fuse@v6.4.1/",
    "deno_dom/": "https://deno.land/x/deno_dom@v0.1.13-alpha/",
    "port/": "https://deno.land/x/port@1.0.0/",
    "puppeteer/": "https://deno.land/x/puppeteer@9.0.1/",
    "media_types/": "https://deno.land/x/media_types@v2.10.1/",
    "observablehq/parser": "https://cdn.skypack.dev/@observablehq/parser@4.4.4",
    "acorn/walk": "https://cdn.skypack.dev/acorn-walk@8.1.1",
    "acorn/acorn": "https://cdn.skypack.dev/acorn@8.4.1"
  }
}

As you can see we are attempting to pin to 0.97.0 of the standard library across the board. However, it appears as if several of our dependencies are doing their own pinning, which results in the duplication. Is there any way around this or is this sort of duplication inevitable?

In terms of hash it looks like it has 150kb of WASM, so if we get somehow get rid of both of those loads we get another ~ 12% off of the startup time -- so we'd have 42% of the original 1.2s of startup time accounted for by WASM loading.

So it seems like there would still be 700ms of startup time (on a very fast laptop) after we eliminate the WASM problem. It still feels like there is a significant problem to solve here -- we deal with a bunch of interpreted languages in our project (mostly Python and R) and in spite of the runtimes being extremely slow the interpreters have close to zero startup cost (so chaining a bunch of calls to them is mostly free).

If Deno CLI applications have a non-trivial startup cost that is linear with bundle size (that's of course speculation at this point) I fear that it will rule out using Deno for CLI tools (expect for fairly small ones).

Hopefully this could be resolved by using a v8 snapshot of the bundled JS file -- if you think there is some chance of that working we could try using deno_runtime (or deno_core?) in a Rust binary with said snapshot of our bundled source. LMK if you think this might be worth pursuing (noted above that it sounds like TLA might be incompatible with snapshots in any case, in that case I guess we'd need to purge all uses of TLA before attempting the snapshot approach?).

jjallaire commented 3 years ago

Attempting to determine the source of the additional hash import, I can see in the generated bundle a reference to https://deno.land/std@0.91.0/hash/_wasm/wasm.js however haven't been able to ascertain who is bringing that in. So far have just inspected the source code of our imports -- is there a more systematic way to do this?

Update: I discovered the source of v0.91.0/hash by experimentation. Still interested in whether there is a more automated approach and whether or not multiple versions of standard libraries are inevitable if libraries pin their std dependencies.

jjallaire commented 3 years ago

Further update, the following three remediations reduce startup time from 1.2s to 480ms:

1) Use the native Rust implementation of deno_dom rather than WASM implementation 2) Eliminate use of hash in application source code (replaced with blueimpmd5) 3) Eliminate puppeteer dependency (which was only needed for tests and was the source of the other hash import and ~ 900kb of bundled code)

Bundle size is now 2.2mb and startup time is 480ms.

This is obviously a huge improvement but possibly still of significant concern for CLIs that need to be interactive speed and especially if startup time grows linearly with bundle size.

bartlomieju commented 3 years ago

@jjallaire you can use deno info <url> to get a tree of all the dependencies (or use https://deno-visualizer.danopia.net/ for browser view).

jjallaire commented 3 years ago

@balupton Wow fantastic! Thank you :-)

balupton commented 3 years ago

@balupton Wow fantastic! Thank you :-)

Not sure why I was @ replied...

kitsonk commented 3 years ago

Regarding the overriding of other imports, yes, it can be done easily. If you are convinced globally that the version of the import can be pinned to another version, you would do something like this in the import-map:

{
  "imports": {
    "https://deno.land/std@0.91.0/": "https://deno.land/std@0.97.0/"
  }
}

If you find that for some reason, another dependency needed a specific version of std, then you can use the "scopes" property to target specific modules.

As @bartlomieju mentioned, the way I knew all the dependencies, which I mentioned in my first comment, was using deno info with your import map and specifically the --json output. Specifically:

deno info --import-map https://raw.githubusercontent.com/quarto-dev/quarto-cli/main/src/import_map.json --json https://raw.githubusercontent.com/quarto-dev/quarto-cli/main/src/quarto.ts

jjallaire commented 3 years ago

Note that deno_dom has now implemented an option for explicit initialization. See https://github.com/b-fuze/deno-dom#webassembly-startup-penalty.

@kitsonk Thank for the pointer on overriding imports. We will try that next. We are currently at 330ms to run 2.8mb of bundled JS (750k of which is the deno_dom WASM) that executes no user code (instrumented to Deno.exit as the first line of code). Is this surprising to you? Might the next step be investigating v8 snapshots or could there be something else going on that's worthy of investigation before we go there?

kitsonk commented 3 years ago

@jjallaire is that with the no_init with deno_dom? Again, I don't think snapshots at the critical path. It is WASM loading, especially base64 encoded WASM. Similarly sized workloads without top-level-await and not loading WASM start up a lot quicker, without customised v8 snapshots.

jjallaire commented 3 years ago

Yes that is with no_init on deno_dom. There is no longer any WASM loading in the bundle (the only TLA is logging) as we also got rid of usages of std hash.

I just did a quick experiment to change our std pinnings from 0.97.0 to 0.105.0 -- this change resulted in 200K of additional JS bundle size (not surprising given 8 versions in the future) but more interestingly took startup time from 330ms to 480ms!

Since that is way more proportional increase in execution time than the increase in code size, it seems like something in the execution of the JS bundle is occupying that time. Is there a straightforward way to profile this?

jjallaire commented 3 years ago

Note that the WASM theory has definitely proved out to be correct as we have done from 1230ms to 330ms by eliminating WASM loading (+ eliminating 900k of puppeteer that was only needed for tests). The remaining 330ms does seem puzzling though as I wouldn't expect parsing & executing a 2.8mb JS bundle to take nearly that long.

jjallaire commented 3 years ago

More findings:

If I update our std dependencies to v0.105.0 and map all other std references to v0.105.0 then we get startup time of 520ms (bundle size 3mb)
If I rather map all std references to v0.97.0 then we get startup time of 330ms (bundle size 2.75mb)

So it seems like code size has nothing to do with the problem and there is some non-trivial execution time being spent when loading the JS bundle (which is ~ 50% worse in v0.105.0 vs. 0.97.0)

jjallaire commented 3 years ago

You should be able to repro this w/ the following:

git clone https://github.com/quarto-dev/quarto-cli.git
cd quarto-cli
./configure-macos.sh # or .configure-linux.sh
cd package/src
./quarto-bld prepare-dist   
time ../dist/bin/quarto

Invoking with no args is effectively a no-op (I've confirmed that once user code gets control total execution time is 6ms).

Bundled source code is at ../dist/bin/quarto.js

jjallaire commented 3 years ago

Attempting to profile, first using the Chrome profiler which didn't work then using the --perf V8 flag that I found references to in older variants of the Deno documentation. Produced a .log file and then processed it with the node --prof-process command (which might not even be valid). Thought that another approach would be build v8 from source to get the d8 shell but wasn't quite sure how to do that w/ denos build toolchain (the previous docs referenced using ninja).

Anyway both the raw log file and the processed version are attached. Not sure whether they include anything useful. isolate-0x7fbd70008000-79520-v8.log processed.txt

jjallaire commented 3 years ago

There were some clues in the profile data that acorn parser initialization might be in play. I speculatively removed all the code depending on acorn + the dependency and that was another 80ms.

It seems likely that the other parts of the 330ms will be of a similar nature (or things completely unavoidable such as bootstrapMainRuntime deno:runtime/js/99_main.js:531:32).

I'm happy to close this issue now assuming you don't want to dig further into what might be going on based on the profiler data. It does seem like if we are really spending 330ms reading the JS file and doing various expensive initializations therein (e.g. acorn data structures) that v8 snapshotting would indeed be of some help -- does that make sense?

bartlomieju commented 2 years ago

@jjallaire is this problem still ocurring with latest v1.17.1?

jjallaire commented 2 years ago

Yes, we do still see the same overhead for loading the JS in v1.17.1. We have a ~ 3mb JS bundle and it takes about 420ms to execute a "do nothing" command where the time that our typescript has control is < 10ms.

littledivy commented 1 year ago

Is this still an issue? FYI I'm tracking startup time optimizations in #15945

jjallaire commented 1 year ago

The improvements in JS parsing in Deno 1.25 took us from ~ 900ms of startup time to ~ 150ms of startup time. This is acceptable but we'd certainly love to see an improvement on the 150ms that are remaining!

bartlomieju commented 1 year ago

@jjallaire could you give any update on this issue? We've done multiple improvements to the startup time in the past two months so I'm wondering if this issue is still relevant.

jjallaire commented 1 year ago

@bartlomieju Thanks for following up! We are just completing our v1.3 release which is pinned to v1.28.2 from November. As we begin v1.4 development we'll go to the latest Deno and report back here on improvements we observe in pure overhead during startup. cc @cscheid

jjallaire commented 1 year ago

Hi @bartlomieju we are now on Deno v1.33.1 so I have some metrics to share. With Deno v1.28.2 our "cold start" time for a do-nothing invocation was ~ 145ms. With Deno v1.33.1 it is ~ 110ms which is a very nice improvement indeed!

JohannesRudolph commented 1 year ago

@bartlomieju unfortunately I'm observing some regression in cold startup performance in builds created with deno compile for our OSS tool collie (see releases)

Here's some stats using the fastest result of three consecutive invocations of time collie

deno 1.21, collie 0.12 binary - real    0m0.172s
deno 1.25, collie 0.15.3 binary - real    0m0.216s
deno 1.34, collie 0.15.3 binary - real    0m0.249s

Note that between collie 0.12 and 0.15.3 we did also change a bunch of code internal to the tool and upgraded cliffy, the cli framework that we use. So I can't quite rule out that's partly to blame for the first drop, but between deno 1.25 and 1.34 we are using identical code here.

denoland / deno

Startup time for CLI applications #11916