Lazy commands - Githubissues

When writing large CLI application, we find ourselves in a pickle. Let's say we have commands similar to:

import something from './lib/something';
import somethingElse from './lib/somethingElse';

export class MyCommand extends Command {
  async execute() {
    something();
    somethingElse();
  }
}

The something and somethingElse functions aren't needed until MyCommand is executed, but since they are in a top-level import the generated code will still import them before even evaluating the command file. At the scale of a large application, those imports start to slow down the startup by a significant factor. We can mitigate it a little by doing something like this:

export class MyCommand extends Command {
  async execute() {
    const [{default: something}, {default: somethingElse}] = await Promise.all([
      import('something'),
      import('somethingElse'),
    ]);

    something();
    somethingElse();
  }
}

But that's really verbose, and that's not even what people doing things like this do (they instead just call import multiple times in a row, like top-level imports, except that it prevents the runtime from fetching / parsing the modules in parallel, making sync something that could be parallelized).

A second problem is that even if the imports are moved into execute, just running files has a cost. They need to be read, parsed, evaluated, and all that when they don't actually contribute to anything at all for the purpose of the command parsing. This problem is exacerbated when using transpilers, as the cost can easily reach hundreds of ms for larger CLIs.

The first point can be solved by the Deferring Module Evaluation proposal, but it's currently still at stage 2 (cc @nicolo-ribaudo in case you're interested by this thread / practical use case), and even with that we'd still have the problem of the files being executed at all (probably not as much a problem if you don't use a transpiler).

Ideally, I'd like to find a way to solve both points.

One strategy would be to tweak the core so that the state machine starts small, and progressively expand as we find new tokens we don't know how to support.

Let's say we have commands set version <arg>, set version from sources, and install. Let's imagine that, instead of the fully CLI-aware state machine we currently provide to runMachine, we instead provide an empty state machine. The runMachine function would accept a "failsafe callback"; this callback would take the current state machine, a stream of token, and return one of three values: ABORT, FEED, or another state machine.

We'll now parse the following CLI input:

["set", "version", "from", "sources"]

Things would go like this:

The token set is consumed. No possible states are found.
The failsafe callback would be called with ["set"] as token stream.
It would look on the filesystem for <root>/commands/set/index.js.
The folder (<root>/commands/set) would exist, but not the index.js file. FEED would be returned.
The failsafe callback would be called again, this time with ["set", "version"] as token stream.
It would look on the filesystem for <root>/commands/set/version/index.js.
The folder would exist, and so would the index.js. A new state machine would be returned, and runMachine would merge it with the existing one.
The token set is re-consumed. A matching state is found.
The token version is consumed. A matching state is found.
The token from is consumed. A matching state is found (the <arg> from set version <arg>) but since it's an argument the failsafe activate nonetheless, this time with ["set", "version", "from"].
It would look on the filesystem for <root>/commands/set/version/from/index.js.
The folder (<root>/commands/set) would exist, but not the index.js file. FEED would be returned.
- Note: If the folder didn't exist, ABORT would have been returned, and runMachine would have simply accepted from as being <arg> without more objection.
The failsafe callback would be called again, this time with ["set", "version", "from", "sources"] as token stream.
It would look on the filesystem for <root>/commands/set/version/from/sources/index.js.
The folder would exist, and so would the index.js. A new state machine would be returned, and runMachine would merge it with the existing one.
The token from is re-consumed. Two matching states are found.
The token version is consumed. A matching state is found, and the <arg> alternative is abandoned.

This approach allowed us to avoid having to make other calls than 4 filesystem calls. It however has a couple of thorny aspects:

For this to work, we need a way to lazily evaluate the command files (the index.js files). This isn't a problem in CJS-land, we have require(). However in ESM, we only have import(), which is asynchronous. That requires to turn runMachine into an async function (breaking change).
Options can be specified before a path. For example, if set version from sources has a --path option, then the user may call it via --path=foo set version from sources. It means that if an option is there, they need to be skipped for the purpose of the failsafe function.
Worse, it's also possible to write --path foo set version from sources. Since we don't have the state machine, we don't know that the foo token is the value associated to the --path option (for all we know, there could be a foo set version from sources command with a --path boolean option). Even worse, since options may have any numbers of arguments (tuples), it could be a from sources with a --path foo set version option!
I wonder how it would interact with #89 (command completion, cc @paul-soporan). In the worst case we can disable the laziness for the purpose of the command completion, but it'd be interesting to find a way to merge them together at some point.

To solve that, if we detect options tokens first, we need to follow an annoying dance. If we assume --path foo set version from sources, then the engine will need to call the failsafe callback on each of ["foo"] / ["set"] / ["version"] / ["from"] / ["sources"], and extend the token stream for each alternative as long as the callback returns either of FEED or a state machine (only ABORT should stop the alternative from being explored). Fortunately, ABORT will be the main result, so in practice only a single alternative will be crawled.

In practice, doing this will require:

Make runMachine asynchronous.
Add a enum FailsafeResult { Feed, Abort } enum.
Add the failsafe callback as part of the runMachine option bag ((tokens: string[]) => FailsafeResult | StateMachine).
Implement a mergeStateMachines function (with tests). Perhaps makeAnyOfMachine is actually enough?
Implement the behaviour described above.
Add tests (first for simple cases, then more complex ones).
Write a Node.js utility (in a separate file) that takes a filesystem path and return a failsafe callback that checks the filesystem. The important part is to avoid adding a filesystem dependency to the core implementation, so that we can also use that on other types of CLIs (for example Yarn, which is bundled as a single file).

arcanis / clipanion

Lazy commands #151