Pick a supported serialization format for host<->macro communication

davidmorgan commented 4 months ago

The dart_model exploration currently uses JSON but there is no particular reason to think that's a good choice :)

Even if we don't follow the dart_model direction, it seems likely we want to formalize the serialization format so that we can have stable serialization. Some well-supported system that allows specification of a schema is probably needed so we can actually have guarantees of compatibility.

It looks like macros often do only trivial computation and so they are going to spend most of their run time deserializing then serializing. The best way to optimize this is to do less work, to send less data; but even then it might be beneficial to also have the fastest possible serialization and deserialization.

Protocol buffers are an obvious possibility, flat buffers might be more interesting since they are explicitly for cases where deserialization/serialization are an important overhead. If we follow the dart_model direction then capability to quickly compute deltas and apply deltas will also be important.

Sending this straight to the backlog since the precise format doesn't matter too much yet.

@jakemac53 @scheglov FYI :)

rrousselGit commented 4 months ago

Would this be possibly related to macros sending information to the analyzer ; which the analyzer can then use to emit custom diagonistics?

davidmorgan commented 4 months ago

Macros are fundamentally built on serialization: when you call async methods on the macro API, the macro communicates with the host via serialized data. The hosts we have today are the analyzer and the CFE.

davidmorgan commented 4 months ago

I looked into this a little today; I don't like the flatbuffers schema format; one possibility that comes to mind is to use a JSON schema coupled with some more efficient way of serializing JSON, possibly we can do something like flatbuffers where data is accumulated into a buffer that is immediately ready to send.

jakemac53 commented 4 months ago

I don't know anything about flat buffers, do you have a tldr? How does it compare with the adhoc format we have today?

jakemac53 commented 4 months ago

Also, the idea of using FFI to enable multithreading + shared object memory with the host has been brought up previously. I have no idea what that concretely means or what it would look like though. @mraleph I heard maybe you would be a good reference here?

jakemac53 commented 4 months ago

It sounds to me from some very quick reading that using FFI to write to a shared buffer (where?), and FlatBuffer to actually control those buffer formats, with Dart wrappers reading from those buffers directly, could be interesting.

davidmorgan commented 4 months ago

Flatbuffers have a schema that looks a bit like a proto schema. I played with it a bit, then ran into vector of unions are not yet supported in the Dart implementation; so you have to work around that by wrapping union types in tables then putting those in the vectors. I think it would end up quite ugly.

To write Flatbuffers you accumulate objects into a buffer; when you "deserialized" that buffer you create objects over it, but don't actually copy the data out.

My suspicion is Flatbuffers can be used for inspiration but we probably don't want to use them exactly. In the same library is Flexbuffers which are a schemaless way of accumulating data in a buffer and reading it back; similarly I think this is useful to look at but doesn't match exactly what we need.

davidmorgan commented 4 months ago

Yesterday I dug more on exactly what a JSON Schema is/does.

My takeaways:

It looks sufficiently powerful to describe what we want, and to evolve the protocol over time while maintaining compatibility.
Tool support and widespread usage are a big plus. We can easily validate against a schema. When I played with a schema in VSCode, I could immediately use it: VSCode would validate JSON against it and suggestion completions.

I think from a usability point of view JSON is great: for example for golden tests it's very convenient if you can embed a block of JSON and be done. Macro authors will thank us.

Then there is the question of performance :)

For this I have an idea, which is that we define the protocol with a JSON schema, and structure libraries over it as if it's JSON underneath, but also provide an implementation that is backed by a binary buffer. The host can accumulate data into the buffer, then send it with no "serialize" step; the client can process it with no "deserialize" step.

I wrote a quick proof of concept and some benchmarks, the headline results

Subject	Scenario	Data size/bytes	Time per/ms
JSON	createSerializeWrite	2051219	46
JSON	readDeserializeProcess	2051219	70
total	116
JsonBuffer	createSerializeWrite	1813834	27
JsonBuffer	readDeserializeProcess	1813834	10
total	37

show about a 3x speedup which is in line with the benchmarks I did on the current macro protocol comparing JSON and binary.

On a large project the difference can be big enough to matter: on the order of 100s of milliseconds or maybe up to seconds if we are processing the whole project, so cutting 70% of that time is going to be noticed if the user is waiting on it. Of course if we are sending small deltas in response to a small number of queries the difference is much less important.

The downside to anything like FlatBuffers/JsonBuffer is that it adds constraints on how you create the data. Pretty much you have to create one thing at a time so it can be appended to the buffer as it's created ... but when you are creating a tree structure you naturally want to descend into the tree and create leaf nodes first.

In JsonBuffer I "solved" this by requiring you to specify keys up front, then space in the buffer can be reserved for the keys+pointers to values, allowing values to be written to the buffer as they arrive. You can of course just create all your data as maps then copy it into a JsonBuffer--but avoiding this "copy in" is exactly the point :) we want the data to be created "in place".

Rolling your own binary format is somewhat a scary/risky thing to do, but I think this approach may work: we limit the complexity by saying it's only equivalent to JSON, it will not evolve over time; we use a JSON schema which is well-tested and well-supported to define correctness; and we also support actual JSON for convenience, for logging, testing and for if people want to use it to interoperate.

Because I think the performance question is likely solvable, I think a good next step will be for me to come up with a concrete example of a JSON schema, so we can see just how nice / horrible it is :)

Feedback and suggestions very much welcome/appreciated :) thanks!

davidmorgan commented 4 months ago

So with about a day of hacking ... and with some of that spent stuck on a validator bug ... and tweaking the model code at the same time ... and having learned about JSON schemas from scratch this week ... I was able to add a schema describing the current package:dart_model:

schema.json

There is an awkwardness that I had not anticipated: because JSON schemas are permissive by default, a schema is not super useful by default. So I think what you want is two schemas: a strict one that describes what the protocol is now and a loose one that describes how it's allowed to evolve.

It seems reasonably straightforward to transform a loose schema into a strict one by saying "don't allow anything not explicitly specified" everywhere, and that's what I did to make progress.

This all seems fine.

I plan to look next at the classes we wrap the JSON with, probably we want to end up with some kind of generation from the schema, and it will need to tie somehow into the binary version mentioned in my previous post.

Hopefully I can wrap this up in a day or two then write up a concrete proposal, if there are any concerns at this point please let me know; also Tuesday's meeting will be a good opportunity to discuss.

mosuem commented 3 months ago

Feedback and suggestions very much welcome/appreciated :) thanks!

Just out of curiosity - why not go with protobuf?

davidmorgan commented 3 months ago

@mosuem good question, thanks :)

For users, JSON looks a bit better: it works with more tools (e.g. you can immediately get schema validation in VSCode) and gives a human-readable format that people recognize, and can use in tests, logs, etc.

And for the client libraries, we would prefer not to copy on serialization/deserialization, like the main proto implementations do. We expect that macros will spend almost all their runtime serializing/deserializing, because the actual logic is usually very simple. So, we want to minimize that overhead.

But I wouldn't say it's totally clearcut that JSON is a better choice than protos, happy to hear arguments for protos. (Or any other format that gives schema + serialization).

jakemac53 commented 3 months ago

Fwiw I would love it if we didn't own the specifics around the binary encoding because that in itself becomes a thing we have to version.

davidmorgan commented 3 months ago

I think it's okay that we own the binary format: it would be not just a serialization format, it's also how the client uses the data, since they don't copy it out. So it's very closely tied to our specific use case.

In the worst case, which it's important to plan for when it comes to serialization ;) ... the host should be able to say "let's use plain text JSON". That way if we discover a problem with the binary format we can release analyze/CFE that force to plain text, and clients built with the bad binary format will start working.

jakemac53 commented 3 months ago

I think it's okay that we own the binary format: it would be not just a serialization format, it's also how the client uses the data, since they don't copy it out. So it's very closely tied to our specific use case.

Something like flat buffer would allow for no copies though, right? I do really worry about owning this format - it seems not much better than the current situation really. I think it would be a requirement to version the binary format in this world, and support multiple versions.

davidmorgan commented 3 months ago

I think we have to version any serialization format we use, to guarantee that we don't end up stuck.

The expectation though is that the binary format would just be equivalent to arbitrary JSON, so it would not need to change when the JSON schema changes. So the goal would be to never release a new version. From my experimentation that doesn't look too hard, but of course we should be reasonably sure about that :)

FlatBuffers don't look like a great fit, they are more oriented towards tables with a fixed list of fields, instead of maps; and we have a need for union types which do exist but are awkward ... https://stackoverflow.com/questions/54445249/flatbuffers-how-to-allow-multiple-types-for-a-single-field ... the flexbuffers format is closer to what we want, but not quite a fit; and because it's really not a lot of code, it looks feasible to just roll our own along those same lines.

jakemac53 commented 3 months ago

My main concern is just around any sorts of optimizations we want to make, if we roll our own it seems likely we will want to change it to enable performance optimizations later on. We can certainly try to learn from existing formats and get it right the first time, but it adds an extra area of potential breakage.

Plus it is just extra code to maintain, test, etc.

davidmorgan commented 3 months ago

Agreed, it'd be nice if we can find one that's a fit.

Looking closer at the flexbuffers implementation it really is very close to what we need, with passing in lazy evaluated maps it might already be a fit on writing ... just a bit awkward on reading. Possibly an adapter could work to make it fit into the current model. But then we might hit a similar desire to make changes to optimize :) so we'd still want to version it.

jakemac53 commented 3 months ago

Possibly an adapter could work to make it fit into the current model.

Does it need to fit into the current model, if it just replaces that model? I don't consider a textual JSON format a requirement.

davidmorgan commented 3 months ago

I meant, fit into how we want macro authors to actually interact with the data, via extension types.

At the moment I'm happy with the tradeoffs around JSON + custom binary format. Protos is another possibility but we'd have to check performance, we should probably do that comparison anyway, it's a good sanity check.

There's cap'n proto which sounds like a fit but there's no Dart implementation.

Other suggestions welcome :)

lrhn commented 3 months ago

It sounds like you're describing a general entity/relation model specification, with entities, properties and relations, and then the actual physical serialization format of that model is an implementation detail.

Using JSON as the serialization format is possible, because JSON is general enough to represent anything, and tree shaped object relations can be represented directly. Object relations that are DAGs or even cyclic are possible too using entities with IDs and relations represented indirectly by the ID.

I'd be worried if we use JSON-like map/list structures as the actual model representation in our code, that's too untyped. The serialization format is just that: Something that both ends can agree on to represent the actual model, and what can be communicated between heaps. (If we can use the VM's object serialization directly, that's also an option, as long as both ends import the model classes.)

Macros are fundamentally built on serialization: when you call async methods on the macro API, the macro communicates with the host via serialized data.

There is definitely cross-heap communication. If we make all the involved objects immutable, we could potentially have a shared heap for those, shared between the heaps of the macro implementation and the macro host. Probably not worth the effort, but theoretically possible, so serialization is a choice, not an inevitability.

davidmorgan commented 3 months ago

The plan is to have generated extension types that match the definitions in the JSON schema

https://github.com/dart-lang/macros/blob/main/pkgs/dart_model/lib/src/dart_model.g.dart

with whatever convenience methods we want to add on top of that.

lrhn commented 3 months ago

Then I'd define the extension types first, and the JSON schema to match the API, not the other way around. (But I'd also prefer to not expose the parsed JSON to users at all. It's a horrible format, and I'd love to not be locked into the transport layer serialization format by making it part of the public API.)

davidmorgan commented 3 months ago

Well ... whether it's exposed to users or not depends rather on whether you consider the capability to cast to Map<String, Object?> as exposing the format ;)

I do think macro authors will be happy to see JSON for logging and for golden files, but otherwise I don't expect them to interact with it--the code they write will not look like it's operating on JSON.

lrhn commented 3 months ago

depends rather on whether you consider the capability to cast to Map<String, Object?> as exposing the format

I do, unless we explicitly document that you should not depend on the underlying representation object's type. But I'd rather just have real classes than extension types. It gives much more control.

jakemac53 commented 3 months ago

Looking over @johnniwinther 's metadata annotation API it seems to me that extension types would not work well for this case, in particular the code method might lose information if called with a static type that is less specific than the actual wire type.

There might be other similar situations like that (TypeAnnotation comes to mind), which justify using actual wrapper classes (could still just be wrappers around byte views though).

@davidmorgan have you thought about dart_model with respect to metadata annotations and code in general?

davidmorgan commented 3 months ago

The extension types expose exactly what's in the schema; we can add whatever code we want into the extension types, so we can add whatever convenience methods we want next to the underlying representation.

It is a non-goal to hide the underlying representation. The problem with hiding the underlying representation is that it's a trap for maintainability. Any smarts you add via code gets compiled into the client, and you don't control which version the client is compiled against, so you can't trust it / evolve it.

This is precisely the "no wrapper libraries" guidance, which is a non-negotiable rule when building any Google service. What we're building isn't a Google service, so we don't have to follow it, but I have yet to hear any reason why it's different--why we think best practices don't apply here. We could maybe discuss in the language meeting :)

I do, unless we explicitly document that you should not depend on the underlying representation object's type.

When you have serialization, people can do anything. They don't even have to code in the same language. It doesn't really make sense to talk about restrictions on code. What you have is guarantees about stability of the wire format. That's all you can actually guarantee, and it's a very fine guarantee--it works incredibly well.

If people do silly things in the code--casting to types we haven't documented, depending on /src/**--then the usual thing happens, we're allowed to break them with a new version. But, their code will continue to work if compiled against the old version then run against new version of the host, as long as they don't rely on new functionality. And that's pretty great.

lrhn commented 3 months ago

We're not providing a Google service. We're not providing a service at all.

We are providing a way to interoperate with the macro host of the current SDK tool that is executing the macro code, running macro code that is compiled against the same SDK version.

This is more like cross isolate communication inside the same VM, than it is a public service that existing compiled programs are run against.

What I would do is to let dart:macros be the library that theses macros are written against, and make sure that that library doesn't introduce breaking changes in non-major SDK version increments. (May have to be language versioned, if we do breaking changes in major version increments.)

You're writing against a Dart API, the feature is a library, just like everything else. The underlying communication with the macro host can, for all we know, be done using the VM's serialization service that sends actual Dart objects between isolates. Or JSON serialization. The protocol for communicating between the isolates doesn't matter, that is an implementation detail that the user should never care about.

We're not going to be running multiple client versions against the same service, they're all compiled against the same dart:macros, which is the one that works with the tools of that SDK version.

Anything more than this is unnecessary complication.

davidmorgan commented 3 months ago

We're not going to be running multiple client versions against the same service, they're all compiled against the same dart:macros, which is the one that works with the tools of that SDK version.

That's simply not true, however much we would like it to be true. We have already broken google3 and Flutter via the macro experiment as a result.

Introducing a new tight version constraint in the build is a breaking change that we could decide to make and force through, sure. Stepping back a bit, though, code generators and the SDK are too tightly coupled. They are a maintenance nightmare as a result. We need macros to be less tightly coupled.

Approaches to solve this type of problem in code instead of in data have been tried, and have failed--I've been on such projects personally, and I now count macros as a member of that list.

I think it might be that you are concerned that the resulting API from the protocol approach is unpleasant to use, or leads to awkward code. On the contrary, from the experiments I think it is great. The code itself is, in my opinion, marginally better than you are likely to get without an underlying protocol, because it's more uniform and based on concepts that are forced to be as simple as possible. With the quality of life improvements you get with a protocol, the overall package is considerably better than you get without a protocol.

I wonder if you've seen the repl demo running? Maybe 50% of the time spent working on a generator is asking "which part of this huge complex API do I need to ask to get the information I need". With the repl you can just change what you're interested in in the source, and have it immediately shown as a delta in the human readable wire format. Then you can directly translate that into code. This is a fantastic developer experience.

The other thing that takes a lot of developer time when maintaining a generator is end to end testing. The capability to talk about code and changes to code using data should be incredibly helpful.

I've been thinking about this problem for more than a decade, and I've always hoped Dart would use this approach to really open up and solve the rather complex problem space. I'm enthusiastic to have a chance to help make it happen :) and hope we can address all--valid!--concerns before too long and in any case well before launch.

Thanks.

lrhn commented 3 months ago

That's simply not true, however much we would like it to be true. We have already broken google3 and Flutter via the macro experiment as a result.

Are we combining code compiled by different SDK versions into the same program? (I would not expect modular compilation artifacts created by different SDKs to be guaranteed to be compatible, so they should be recompiled on each SDK upgrade. Otherwise they might be refering to things we've removed, or assume incorrect types of something that has changed.)

Was that breaking due to the macros API itself changing in a breaking way? Because that is probably happening while we are developing that library, but it shouldn't happen after launch.

I pretty much insist that it doesn't happen. If code works with SDK 3.7.0, then the same code should still work with SDK 3.8.0. That's the minimum required compatibility requirement that is acceptable. Any exception should go through the breaking change process, even if it's only breaking macros.

I am not introducing any tight version constraint. There is exactly one version constraint, on the SDK, which is a lower bound. If a macro works with Dart SDK 3.7, it should work with any Dart SDK with version 3.X.0, X >= 7, so that any code which uses that version of the macro would keep working on SDK 3.X.0. That's what the SDK constraint of sdk: ^3.7.0 represents, and it should be enough.

I am introducing a strong backwards compatibility promise on the API. But by only doing it at the API level, we can change the underlying implementation if a change to the language makes that necessary.

The moment we try to introduce a separate version for something, especially something which is strongly tied to the compiler, we can easily descend into version hell. That's why I prefer to just not go there. Trying to solve version hell with a versioned wire format is solving a self-inflicted problem.

It does require the APIs to be very well thought out to be able to incorporate future language changes in a backwards compatible way. Maybe it requires more forethought than possible, and that is the best argument against this approach. It's definitely not just "make a model of the language today", it requires clear abstraction lines, ways to not expose information that the client cannot handle, etc.

I'm not convinced that that problem is signficantly different whether versioning a Dart API and a JSON wire format. They're making the same data available, should allow the same queries to be asked, but by exposing the wire format, we lose the ability to enforce constraints on the data, and we'll have to validate instead. And we leak the internal communication protocol.

I'm not against making the internal protocol available for tooling, maybe through dart:developer, we should have good tools for people building macros. But I don't want to let them into the machine room by exposing implementation details.

And I'm not convinced that versioning is a problem best solved by introducing more versions to worry about. Keeping it at one is optimial. If possible. I'm not convinced it isn't. Using a more complicated, but more general, tool to solve a complicated problem is reasonable. But I don't have your experience, so I can't see why this problem has to be more complicated. (And my design principles are usually "simplify, simplify and simplify", so anything that looks superflouous to the problem at hand is worrisome.)

I wonder if you've seen the repl demo running?

I have not. I should really start building macros, not just writing about them. :)

davidmorgan commented 3 months ago

That's simply not true, however much we would like it to be true. We have already broken google3 and Flutter via the macro experiment as a result.

Are we combining code compiled by different SDK versions into the same program? (I would not expect modular compilation artifacts created by different SDKs to be guaranteed to be compatible, so they should be recompiled on each SDK upgrade. Otherwise they might be refering to things we've removed, or assume incorrect types of something that has changed.)

So I think there are three types of concern here.

In google3 we found a place where, indeed, there is version skew between libraries and the CFE binary. The details don't matter too much, but this is something that easily happens, for example in bootstrapping, or if you need multiple SDK versions for some reason already. (Flutter and not Flutter, ...). It's helpful to allow a very small version difference; a new requirement for an exact match is extra work.

Then there is the analyzer versioning problem: the analyzer library now pins to the macro library which pins to the SDK version, but various things want to also have a say about the analyzer version, most obviously generators. https://github.com/dart-lang/sdk/issues/55870 is the issue for the extra work caused by the mismatch between Flutter's approach to versioning (pin everything!) and the new requirement on the analyzer.

And finally there is a more general concern that because macros are binaries it will be useful to have them keep working; whatever we or others end up doing with them, having them continue to do what they used to do when there is a new SDK release will be a nice feature to have.

Was that breaking due to the macros API itself changing in a breaking way? Because that is probably happening while we are developing that library, but it shouldn't happen after launch.

One issue is that currently macros doesn't use a stable serialization format, so it can't cope with version skew even for changes that are code compatible.

And the other, yes, is breaking API changes: not to the user-visible API but to the analyzer/CFE-visible internal API.

Either an incompatible API change or a compatible API change that causes a serialization change requires a release that might not version solve (yet; requiring publishing or an update to a pin).

I pretty much insist that it doesn't happen. If code works with SDK 3.7.0, then the same code should still work with SDK 3.8.0. That's the minimum required compatibility requirement that is acceptable. Any exception should go through the breaking change process, even if it's only breaking macros.

I am not introducing any tight version constraint. There is exactly one version constraint, on the SDK, which is a lower bound. If a macro works with Dart SDK 3.7, it should work with any Dart SDK with version 3.X.0, X >= 7, so that any code which uses that version of the macro would keep working on SDK 3.X.0. That's what the SDK constraint of sdk: ^3.7.0 represents, and it should be enough.

I am introducing a strong backwards compatibility promise on the API. But by only doing it at the API level, we can change the underlying implementation if a change to the language makes that necessary.

The moment we try to introduce a separate version for something, especially something which is strongly tied to the compiler, we can easily descend into version hell. That's why I prefer to just not go there. Trying to solve version hell with a versioned wire format is solving a self-inflicted problem.

It does require the APIs to be very well thought out to be able to incorporate future language changes in a backwards compatible way. Maybe it requires more forethought than possible, and that is the best argument against this approach. It's definitely not just "make a model of the language today", it requires clear abstraction lines, ways to not expose information that the client cannot handle, etc.

This is the tough part :)

Both analyzer and CFE have experience modelling the Dart language, and @scheglov and @johnniwinther both seem to be strongly of the opinion that any modelling you choose will be broken at some point :) and my response is that we should simply build with that as part of the plan.

This also frees us from the temptation to aim for both complete and correct in v1.

I'm not convinced that that problem is signficantly different whether versioning a Dart API and a JSON wire format. They're making the same data available, should allow the same queries to be asked, but by exposing the wire format, we lose the ability to enforce constraints on the data, and we'll have to validate instead. And we leak the internal communication protocol.

I'm not against making the internal protocol available for tooling, maybe through dart:developer, we should have good tools for people building macros. But I don't want to let them into the machine room by exposing implementation details.

Here is where I think our experience leads us in different directions.

For me the strength that data brings is that it's a lot easier to talk about data than it is to talk about code. Code is more powerful, and as a result it can cheat--in ways that work today but will cause problems tomorrow.

A concrete example is that the current protocol uses identifiers that map to object identity on the host; if you round trip an object with id = 1 then it gets deserialized to the original instance, which can be any subtype and carry any additional data. This pulls meaning out of the wire format and fundamentally limits what you can do with the data.

You can think of ways of limiting what the code is allowed to do, but I think in the end this ends up being equivalent to defining the data format instead of the code.

Possibly also based on our different experience :) I don't see anything negative about choosing to focus on a data format over a code API. Just generally I'm happier when a company provides a serialization protocol over something I want to use than a library ... because the serialization protocol is usually more accessible (e.g. works with more languages) and maintained (e.g. works for longer).

And I'm not convinced that versioning is a problem best solved by introducing more versions to worry about. Keeping it at one is optimial. If possible. I'm not convinced it isn't. Using a more complicated, but more general, tool to solve a complicated problem is reasonable. But I don't have your experience, so I can't see why this problem has to be more complicated. (And my design principles are usually "simplify, simplify and simplify", so anything that looks superflouous to the problem at hand is worrisome.)

So I think we've spent a lot of time asking, can we launch first time a way to describe the Dart language that won't need to change incompatibly for any future change to the language.

And I think that has been expensive :) and has led to us saying things like "well ... we just won't make breaking changes" and nobody being really happy with that.

And fundamentally I think it is an awkward design. Macros talk about Dart code. Talking about Dart code is something that isn't tied to the language version or SDK version; there can be a hundred different macros APIs, from fifty different authors, in any combination of implementation language and data formats. Restricting to exactly one is rather like if we had exactly one editor we maintain, tell everyone to use, and don't let people switch away from.

Hopefully we are the best people to build an API over Dart code :) since we anyway own the main host implementations. But, I don't think we can necessarily get it right first time, and I don't think we want to have "don't force an incompatible change for macros" as a design constraint on future language changes, either.

I wonder if you've seen the repl demo running?

I have not. I should really start building macros, not just writing about them. :)

More fun things to come in the near future, with any luck; I'll be on vacation for a week, though :)

Thanks.

dart-lang / language

Pick a supported serialization format for host<->macro communication #3872