Modules syntax config: "dir" being optional, but "main" relative to it is a footgun

cloudflare / wrangler-legacy

🤠 Home to Wrangler v1 (deprecated)

https://workers.cloudflare.com

Apache License 2.0

3.2k stars 337 forks source link

Modules syntax config: "dir" being optional, but "main" relative to it is a footgun #1951

Closed geelen closed 2 years ago

geelen commented 3 years ago

In the documentation:

In particular:

main (required): The relative path of the main module from dir, including the ./ prefix.

This tripped me up, and I think it'll trip up others. E.g. in the example immediately above:

[build]
command = "npm install && npm run build"

[build.upload]
format = "modules"
main = "./worker.mjs"

It's really not clear that this will actually be uploading a file called ./dist/worker.mjs. So, if someone copies a template, they're likely to make the same mistake I did, which is assume the example is building into the same directory as wrangler.toml, and then write ./dist/worker.mjs, which actually means ./dist/dist/worker.mjs, which won't work.

Worse still, the error message is misleading, see #1952

Suggested fix

Easiest option, make "dir" required, and make main not start with ./ (so it doesn't look relative to the wrangler.toml file):

[build.upload]
format = "modules"
dir = "./dist"
main = "worker.mjs"

Now at least there's some indication that main might be relative to dir, though I'd argue this is still confusing.

To suggest better solutions I think I'd need a bit more context on why some of these settings need to be there, however. I don't understand why "main" would be relative to "dir" at all, or why "dir" is even necessary? Surely this can be simplified? If there's a bunch of other files in "dir" that need to be included, can't they be specified in something like an include = [...] syntax? Or import './x.html' syntax in the entry point?

If we get rid of dir entirely, and just make main relative to wrangler.toml, we'd get this:

[build.upload]
format = "modules"
main = "./dist/index.mjs"

This is basically what we have in the example right now, but without the invisible footgun of a missing dir argument.

nilslice commented 3 years ago

Fully agreed, @geelen. I figure there is another reason to have the dir configurable vs. only using a path to main, but if not we should change this. Maybe @xortive could clarify?

xortive commented 3 years ago

dir and main are required because wrangler doesn't actually traverse the module graph from the entrypoint to figure out what to upload. Writing a parser for ES6 import/export syntax and re-implementing the node.js module resolution algorithm would have been a decent sized project on top of modules + custom builds support that I'd have expected to have subtle bugs until it got more mature -- if wrangler was written in JS where I could have used existing code, I'd have gone this route instead.

The dir + main solution was much simpler to implement and reason about, as it directly maps to what the runtime expects: a list of modules with names, and the name of the main module. The runtime has no concept of paths, but module names tend to be derived from paths.

@geelen I agree this is a bit of a footgun, I think making dir not optional would be great, but not sure how to introduce that without breaking folks. Until we have a way to parse the module graph, we're stuck with requiring dir in some form, whether as an implicit default or always required.

geelen commented 3 years ago

dir and main are required because wrangler doesn't actually traverse the module graph from the entrypoint to figure out what to upload.

Until we have a way to parse the module graph, we're stuck with requiring dir in some form, whether as an implicit default or always required.

I'm not sure I follow why this would be required? I'm talking about changing how the main path is interpreted: ./dist/worker.mjs being relative to wrangler.toml rather than ./worker.mjs being relative to the implicit dir variable. It shouldn't require parsing the module graph, at least to my understanding?

it directly maps to what the runtime expects: a list of modules with names, and the name of the main module. The runtime has no concept of paths, but module names tend to be derived from paths.

So, if I'm reading this correctly... dir = "dist" effectively becomes a glob of include = ["dist/**/*"], where each "file" becomes a module, indexed with its relative path to dist, i.e. dist/index.mjs becomes keyed with ./index.mjs? Without the dir argument, wrangler doesn't know which files to upload, and without the main argument matching exactly the relative-paths-from-dir format, the runtime doesn't know which "module" is the entry point?

As an aside, because main isn't ever actually interpreted by Wrangler as a path (it's just a string key to an implicit glob), it's never actually checked whether it exists before the upload takes place? Hence why #1952 reports an error from the API, not from the CLI before the upload.

That all makes sense to me now, but I'd class this as a leaking the runtime's model into the CLI tooling, rather than working from the user's perspective. If my understanding is correct (we need a list of files to upload and a named entry point), then I think the easiest change is the following:

[build.upload]
format = "modules"
main = "./dist/index.mjs"
include = ["./dist/**/*"]

The only optional entry is include, which defaults to [].

Semantics would be as follows:

main is relative to wrangler.toml
We resolve main and verify the file exists there, and always upload it as the entry point of the module
include follows normal glob syntax, i.e. {} for multiple matches, ! for exclusions etc
Files matched by include are deduped with whatever is in main, to handle the case when they're in the same directory
Files matched by include are uploaded with a key of their relative path to main. So, in the case where index.mjs references a file file.html in the same directory, import "./file.html" will work as expected.

I think this matches a use case where a user would start with a single file module, then look to "include" extra assets in the upload as a separate step. That separate step may not require the same build process as what generates the main module at all, for example:

[build.upload]
format = "modules"
main = "./dist/worker.mjs"
include = ["./public/**/*"]

In this case, the worker would need to reference files using a relative path (import "../public/index.html") but I think that's pretty understandable for a user (at least, it makes more sense than paths being relative to wrangler.toml, imo)

I think making dir not optional would be great, but not sure how to introduce that without breaking folks

Given this only affects module syntax workers, which are in prerelease, does that not give us a bit of scope to make a breaking change? I don't know if we have any infrastructure to support release notes on a wrangler upgrade, or a codemod of their config file, that sort of thing.

I believe wrangler.toml was going to get a date stamp for runtime versioning in future, we could introduce that here: anything without a datestamp but that is using module uploads gets an error that dir/main has changed semantics, and get the user to add the version in manually once they've fixed their config?

At the very least, we can throw an error if we see someone pass dir with a message to the release notes, and since we'll be verifying the path of main the user will get an early error instructing them to change ./worker.mjs to ./dist/worker.mjs. Not great, but at least it's explicit. Again, the error message can reference the release notes.

Thoughts?

xortive commented 3 years ago

I'm not sure I follow why this would be required? I'm talking about changing how the main path is interpreted: ./dist/worker.mjs being relative to wrangler.toml rather than ./worker.mjs being relative to the implicit dir variable. It shouldn't require parsing the module graph, at least to my understanding?

Right now wrangler's behavior is to upload everything in dir, using the relative slash path (so ./dist/index.mjs and .\dist\index.mjs become ./index.mjs so projects work across unix/windows) of every file in dir as the module name. This implies all import statements must be relative to dir, as there's only a single path that will work for importing a module.

Once we get into the business of allowing folks to use relative paths from a given module instead of from dir, wrangler must actually parse modules to read their import statements and include those modules in the upload, and since the runtime has no concept of paths in module names, if two files at different levels (therefore with a different relative path) import the same module, we have to replace the module name in the import statements with an absolute path before uploading (or some other scheme) for the import to work, or make the runtime treat module names as paths.

In this case, the worker would need to reference files using a relative path (import "../public/index.html") but I think that's pretty understandable for a user (at least, it makes more sense than paths being relative to wrangler.toml, imo)

I would 100% prefer this behavior, but implementing a way to parse es6 modules to read and modify their import statements from scratch is a tall order, and I don't think increasing wrangler's split in code between Rust and JS (in order to leverage parsers written in JS) is a good idea as wranglerjs has been a recurring source of pain due to the difficulty of installing it at runtime consistently.

Given this only affects module syntax workers, which are in prerelease, does that not give us a bit of scope to make a breaking change? I don't know if we have any infrastructure to support release notes on a wrangler upgrade, or a codemod of their config file, that sort of thing.

I believe wrangler.toml was going to get a date stamp for runtime versioning in future, we could introduce that here: anything without a datestamp but that is using module uploads gets an error that dir/main has changed semantics, and get the user to add the version in manually once they've fixed their config?

I don't believe module format workers are considered prerelease anymore but @Electroid could confirm. We could use the new runtime versioning scheme to change the behavior in the runtime to interpret module names as paths and to allow relative imports, which is likely easier than re-writing them in wrangler.

geelen commented 3 years ago

Hmm, I'm not sure I see the problem. I'm not suggesting going all the way to parsing import statements, just using a different base for the paths-as-names approach.

if two files at different levels (therefore with a different relative path) import the same module

I think I can just see this happening in a rare edge case:

Have a worker bundle A.mjs compiled into dist/A.mjs
Have more JS files, B.mjs and C.mjs outside the dist directory, that import each other, but aren't bundled

// A.mjs
import B from '../public/B.mjs'

// B.js
import C from './C.mjs'

In the second case, the runtime won't be able to find ./C.mjs since it's in the bundle with a key relative to A. Is that the edge case you're talking about? Seems to me the same thing is already possible if you had a subdirectory inside dist with two files referencing each other, or if any of the imports left off the .mjs extension (fine in a bundler, but not an exact string path match).

But in any case, a JS bundler will sort that out, even spanning directories. Are we expecting to be uploading module format workers in a deconstructured manner like this? esbuild is lightning fast and doesn't require a config file, as long as the assets are all JS. Are there non-JS assets we're expecting that will also import other files?

I would 100% prefer this behavior, but implementing a way to parse es6 modules to read and modify their import statements from scratch is a tall order

I don't think these are linked. Am I missing something?

change the behavior in the runtime to interpret module names as paths and to allow relative imports, which is likely easier than re-writing them in wrangler.

This might be a good idea for other reasons, though again I'm not proposing that we need to rewrite anything in Wrangler. Keep paths-as-names for now, just change the config and replace dir with include.

xortive commented 3 years ago

In this case, the worker would need to reference files using a relative path (import "../public/index.html") but I think that's pretty understandable for a user (at least, it makes more sense than paths being relative to wrangler.toml, imo)

I'm referring specifically to this line as impossible to implement without parsing modules -- In order to avoid parsing import statements, all paths need to be relative to the same point, either wrangler.toml (which is the only way that makes sense using the include glob config) or dir (how it is now). If you allow different paths to refer to the same module, we need to parse paths in import statements and convert them to use the canonical path that's actually used as the module name.

The rest of your proposal is possible (using globs instead of dir), but I don't think allowing folks to import from multiple directories and using wrangler.toml as the base for relative paths as module names is worth the effort over what we have now -- the usability problem here is dir being implicit, and the poor error message when the configured main module isn't present in the uploaded set of modules, both of which we could fix without breaking users (by adding dir to the default generated wrangler.toml for new projects, and improving the error message either in the runtime validator, or in wrangler itself).

Solving the paths-as-names abstraction being leaky would be nice, but bundlers handle that for us (as you pointed out) for the vast majority of use cases.

But in any case, a JS bundler will sort that out, even spanning directories. Are we expecting to be uploading module format workers in a deconstructured manner like this? esbuild is lightning fast and doesn't require a config file, as long as the assets are all JS. Are there non-JS assets we're expecting that will also import other files?

This is another reason we chose the current design -- the vast majority of modules will end up in the main bundle, since bundling is required for any non-trivial use case, and bundlers handle module resolution for us. In the majority of cases, only wasm modules, and the mjs shim used for allowing commonjs modules to be used will ever be uploaded separately from the bundle.

geelen commented 3 years ago

I'm referring specifically to this line as impossible to implement without parsing modules

Why do you think it's impossible? If the ./dist/worker.mjs is the entry point, and that's used as the reference point in calculating paths, then include: ["./public/**/*"] would add each file found, but with a path-as-name relative to the entry point (i.e. "../public/index.html").

In the standard case, if the entry point and all related files are together in dist, this ends up producing the exact same result as what we have right now. It just generalises better. And, for a single file upload, you only ever specify a single file: the exact path from wrangler.toml to the file to be uploaded.

xortive commented 3 years ago

with a path-as-name relative to the entry point (i.e. "../public/index.html").

AHH I see what you mean now, "../public/index.html" is actually the path-as-name -- not sure why that didn't click for me before.

And, for a single file upload, you only ever specify a single file: the exact path from wrangler.toml to the file to be uploaded.

I do quite like this simplicity for the common case, I just worry about the feasibility of introducing it now is all. This does seem like an improvement to me, and if we can make it happen without too much toil it'd be nice.

Do we really need a separate include from the globs used for determining module types, you think?

geelen commented 3 years ago

not sure why that didn't click for me before.

No worries! Glad I wasn't missing something major that would have scuttled the idea.

Do we really need a separate include from the globs used for determining module types, you think?

Oh the rules section? Actually, reusing that might make a lot more sense. But again, those paths are currently dir-relative and I'm proposing making them wrangler.toml-relative. But I suppose if we can find a way to make the change, we could change both...

Quick aside: can the entry point ever be anything other than an ESM file? And I'm not sure what fallthrough is used for?

First step I'd say is figuring out whether changes like this are possible to roll out without major disruption. Let's assume we're just going to do the simplest thing (i.e. just make dir required), see how much work that would be, and figure out if it's much worse to make further changes at the same time?

It still raises a couple of questions:

Do we warn on every wrangler command if there's a [build.upload] section but no dir argument, but still allow 'dist' through as a default value?
Or, should we crash with a descriptive message, saying that they should add dir = 'dist'?
Can we start with a warning for one point release of Wrangler, then start crashing?
Should we take a codemod style approach and auto-insert dir = 'dist' into wrangler.toml for the user? How long do we leave this code in for?

Personally, I think it'd be great to be able to auto-update wrangler.toml for the user, though probably not without an initial prompt. I'd also love to only do so if we detect the wrangler.toml as being "old". To that end, I think we could do the following:

wrangler_version = "1.18.0" # auto-inserted on project creation
type = "javascript"
name = "my project"
# ...

Since changes in config would only be rolled out with a new Wrangler version, if we track the version of a wrangler.toml file in this way, we can start to do deprecation warnings. For any file without a version, we can assume it was on 1.17.0.

With versioned config we don't need to upgrade the user immediately, and our notification messages get better:

> wrangler publish

⚠️ Note: You are running Wrangler v1.18.0 but wrangler.toml has version 1.17.0
  📝 Release notes: https://github.com/cloudflare/wrangler/blob/master/CHANGELOG.md#v1180
  👉 To auto-upgrade, run: wrangler upgrade

�  Running yarn typecheck && yarn build

...

This has the benefit of not touching a user's file without their explicit action (running wrangler upgrade or whatever we call it, maybe wrangler config update?). It also never prevents a user from deploying an existing project, for example, on a CI machine that's always installing the latest version of wrangler during a build.

At a certain point we can drop support for older versions with a message like this:

> wrangler publish

❌ Error: Wrangler v2+ only supports wrangler.toml of versions >= 1.18.0
  👉 Read the upgrade guide: https://developers.cloudflare.com/workers/cli-wrangler/upgrade-guide
  👉 To install Wrangler v1 instead, see: https://developers.cloudflare.com/workers/cli-wrangler/install-v1

If we put something like this in place I think we'll get a pretty smooth experience. It basically makes anything code-moddable feel "backwards-compatible"—since for older wrangler.toml files we know to run any transform on the config after we load it. But, when running wrangler config update we load, run the transform, then write back to the toml file, along with the updated version marker. No idea how to write that in Rust, but I'm game to try to learn :)

xortive commented 3 years ago

the wrangler upgrade explicit action to modify the toml is nice, I definitely prefer that (keeping old semantics working while notifying interactive terminals you can upgrade) so we don't break CI.

Note modifying wrangler.toml in particular is tricky without messing with formatting -- we have toml-edit in wrangler now, but it only supports toml 0.4 syntax while the main parser toml-rs supports the full toml 1.0 syntax (namely, dotted keys). There's a good chance some folks are using toml 1.0 syntax, and we'd fail to edit their config as a result.

geelen commented 3 years ago

Ha! Faced the exact same problem with FABs, which is why I went with JSON5 since I found that jju handles editing the file but preserving formatting and comments.

I see toml-rs as only supporting version 0.5, is there much difference between 0.4 and 0.5? Will toml-rs not preserve formatting at all?

xortive commented 3 years ago

Ah sorry I misremembered, 0.5 added dotted keys, while 1.0.0 I think is 0.5 plus only minor changes. Dotted keys are pretty nice, I’d expect some folks to be using them and some of our old examples used them before I noticed the discrepancy.

toml-rs doesn’t preserve formatting or comments at all.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity in the last 180 days. It will be closed if no further activity occurs in the next week. Please feel free to comment if you'd like it to remain open, and thank you for your contributions.

geelen commented 2 years ago

I suppose this is less relevant with the Wrangler2 beta, which actually crawls the module graph to figure out what to upload: https://github.com/cloudflare/wrangler2

silence48 commented 2 years ago

just commenting to say i wasted 2 hours on this issue.

threepointone commented 2 years ago

Just a note on how we handle this in wrangler2 - we're deprecating build.upload.main in favour of a top level main field (and there isn't a dir field). We also log a warning with the exact line to copy paste into wrangler.toml (see https://github.com/cloudflare/wrangler2/pull/528) The module system now applies to both modules and service worker format workers (we've also deprecated build.upload.format since we can infer it automatically)

We're working on docs and migration guides before we roll this out to GA.

geelen commented 2 years ago

ace rimmer, what a guy!