Proposal: parse and stringify packages

43081j commented 3 months ago

The yaml package is currently 600KB+

Not exactly half and half, but that is split between the parser and the stringifier.

Quite a lot of us don't need the stringifier and only use this package to parse files (reading config etc).

So I wonder if there's opportunity here for us to publish some scoped packages alongside the main package, something like @whatever-yaml/parse and @whatever-yaml/stringify.

This would cut the size of a huge amount of packages. As part of the ecosystem cleanup, we'd be happy to help out with contributions etc too, to get this over the line if you think its a sensible idea

if you're not open to this or have doubts, do let me know and we can close if it turns out to be a bad idea

on a side note, a huge portion of the package is the browser module too. i wonder if these days we can somehow merge these back into one, rather than having to ship double.

eemeli commented 3 months ago

Splitting the parser and stringifier apart is ... hard. Not only is the code deeply intermingled, but there are also corner cases where the parser is required for stringification. Consider this brief bit of valid YAML, for instance:

&a [ *a ]: b

That's a mapping with a single value pair, where the key is an array with a single value, the array itself. This sort of self-referentiality is perfectly fine in YAML and generally in JS as well, but in this instance it's in a mapping key, which by default is represented by an Object in JS, and there we need to stringify it:

YAML.parse('&a [ *a ]: b')
→ { '[ *a ]': 'b' }

Of course that stringification is avoidable by targeting a Map instead:

YAML.parse('&a [ *a ]: b', { mapAsMap: true })
→ Map(1) { <ref *1> [ [Circular *1] ] => 'b' }

But the point is, stringification is sometimes required when parsing YAML into JS, and it could not be replaced by JSON stringification, as it does not support self-referentiality.

on a side note, a huge portion of the package is the browser module too. i wonder if these days we can somehow merge these back into one, rather than having to ship double.

I would be very interested in being able to only ship a single build. They're currently separate because they have different polyfill needs, and it's plausible for a browser build to be better able to treeshake some parts of a ESM rather than CJS build. The node build will almost certainly need to stay as CJS until --experimental-require-module is unflagged and available in all supported Node.js versions -- which will probably take a few years.

If there is a way to drop the separate browser build, especially in a non-breaking way, I'd be very interested.

43081j commented 3 months ago

are there other areas you're aware of where stringify is used during parsing?

maybe its possible we'd have a subset of stringification used for generating these keys since its unlikely it'd need the full set of functionality. or is that not true and you could have whatever value you want in those later-stringified keys?

as for the browser bundle, it does make things awkward when trying to ship a dual package for sure. whatever solution we go with, we'll end up shipping two copies of the package in one (cjs, esm) until we can go full esm one day.

i suppose the only way you'd get around that is by publishing a separate esm package or making the entrypoint async (so you can use a dynamic import in a CJS wrapper instead of doubling up the code)

eemeli commented 3 months ago

are there other areas you're aware of where stringify is used during parsing?

Not sure; there could be something in error reporting. But keep in mind also that the code is structured so that parsing and serialization both go through an AST/data model representation of YAML, which supports operations in both directions. For example, take a look through some of the files under src/nodes/ or src/schema/ to get an idea how intermingled they are.

Separating them out sounds like it'd be a very deep refactor of the codebase, resulting in code that would require looking in two separate places when working with any one part of the data model.

maybe its possible we'd have a subset of stringification used for generating these keys since its unlikely it'd need the full set of functionality. or is that not true and you could have whatever value you want in those later-stringified keys?

YAML keys can hold any values, the previous was just a minimal example.

i suppose the only way you'd get around that is by publishing a separate esm package or making the entrypoint async (so you can use a dynamic import in a CJS wrapper instead of doubling up the code)

I am not interested in requiring an async import of the package. You may of course await import('yaml') if you so wish, but require('yaml') must continue to work in CJS.

43081j commented 3 months ago

so in your example, the key could be any valid YAML. meaning we need the full stringifier just in case, since we might need to stringify anything into a key

seems there's not a lot we can do if that's the case

similar on the dual package front. duplicating the code is a waste but if you want esm/cjs in one package, it is all you can do given the constraints

i'll have a deeper dig around the source when i get time to see if there's still some savings to be had, but these two are the main ones and we can't do much about either

conartist6 commented 3 months ago

In terms of packaging why not just build to CJS and ship that with an ESM entry point that just re-exports from CJS?

eemeli commented 3 months ago

@conartist6 That's doable, if you can show that users of the browser build would not be too negatively affected by the packaging change, and that users of the CJS build would not be too negatively affected by the need to transpile the code more than it is currently.

Back when I originally set up the ESM and CJS builds, IE11 was still a thing that needed to be accounted for, and the browser build needed a lot of transpilation and polyfilling, even if the packaging was more modern. But the yaml@2 browserslist query defaults, not ie 11 doesn't even include any of the Chakra-based Edge versions anymore, so maybe the difference is much less?

To be explicit about it, I don't really have the bandwidth atm to study this, but I'd welcome someone else looking into the issue.

conartist6 commented 3 months ago

Totally understand you don't have the bandwidth. This issue is of interest to a pretty significant and active community in e18e, so if we knew what good changes might look like I'm pretty sure we could get the work done.

On the topic of what the changes could look like: It sounds to me from what you're saying like a new major version that forces mapAsMap: true would be sufficient to unlock these gains, would it not? It's generally the kind of thing I like to see in a major bump: you give something and you get something that makes it worth your while to do the upgrade.

eemeli commented 3 months ago

It sounds to me from what you're saying like a new major version that forces mapAsMap: true would be sufficient to unlock these gains, would it not?

That would resolve some of the issues, but using Map rather than Object is not necessarily more ergonomic for most use cases, as most YAML uses scalar keys that can be much more easily stringified. It would also still leave the code structure questions unaddressed, as that would need to be split as well, and that's a very deep refactor.

eemeli / yaml

Proposal: parse and stringify packages #569