aaubry / YamlDotNet

YamlDotNet is a .NET library for YAML
MIT License
2.48k stars 466 forks source link

Adding async methods to serialization API #908

Open DanielBeckwith opened 4 months ago

DanielBeckwith commented 4 months ago

This draft is to bring up the question of adding Async versions of the Serialize and Deserialize APIs, and all of the complexity that that will entail. This was brought up originally in issue #430. So far my branch only contains more or less what the closed pull request #457 had.

Before I go any further I want to make sure that I address a series of architectural questions and concerns (which I'll put in the following comment) that might block the implementation due to how deeply the code will have to be modified.

DanielBeckwith commented 4 months ago

There are several questions that I ran into right away.

First is that while adding the appropriate SerializeAsync and DeserializeAsync methods to the Serializer and Deserializer respectively is an expected change, their current setup will require *Async versions of methods in many interfaces. This includes (but isn't limited to) IEmitter, IObjectGraphTraversalStrategy, IObjectGraphVisitor, and IValueSerializer, which is a breaking change. Given that these interfaces (and others like them) are routinely used by library users to personalize the de/serialization process to meet their needs, this might become quite a burden and prevent them from updating to the newest version. An alternate solution could be providing a default implementation of the *Async method that just calls the synchronous version, but I think that could lead to some hard-to-catch foot-guns. Any guidance or advice on this would be helpful and welcome.

Second, following the original comments in pull request #457 before it was closed (and my own inspection of the code), it seems unavoidable that at least the Emitter will have to change. The changes in that class will revolve around the StateMachine method. That method handles the logic of serialization and writing the results (through the Write and WriteBreak methods). To make that work with the asynchronous API, I see two paths: refactoring or duplicating the code with a StateMachineAsync method. For the refactoring, my preliminary thought was to change StateMachine (and the functions that it calls) into a generator function (returning an IEnumerable<string>) that yields the results, and moving the Write (or WriteAsync) calls into a loop in the respective Emit method (effectively moving the responsibility for writing the results from StateMachine to Emit and EmitAsync). That seems like a rather large, invassive, and unclean change, though, with unknown performance implications to boot. The other option, duplicating the code into a StateMachineAsync version, carries all the normal maintainability issues that duplicate code has. Is there a preference for either option? Or a better one that I didn't think of?

Third, and most tricky as far as I can tell, is the ValueSerializer class. From what it looks like, the SerializeValue method and Emitter infrastructure (including logic that uses the Emitter like the Object Graph Traversal Strategy) needs to have an additional Async version, which seems like a pretty deep and annoying break to the API. However, I'm not well-versed enough in the library to know what's necessary or if I'm overthinking this.

Lastly, is it preferred to have the asynchronous methods return Task<T> or ValueTask<T>? Task is the default choice for asynchronous methods and (as far as I can tell) is supported by every build target. ValueTask is used to (hopefully) prevent unnecessary allocations (like in the built-in System.Text.Json API), but is (as far as I can tell) not supported by every build target for this library, and comes with some more complicated usage constraints that might lead to bad behavior. I went with Task in my branch and lean towards it myself, but I figured I'd ask early in case there was a preference for going the most performant path over other concerns.

While I haven't added any tests of substance in the branch yet, I plan on finishing those once the bigger issues are taken care of.

Of course I welcome any comments, concerns, questions, etc, including those that I haven't touched on.

DanielBeckwith commented 4 months ago

The commit I just made updates the Emitter to use IEnumerable<string> generators internally instead of directly writing to the output so that the async path can reuse the logic (though I believe this results in a little bit of a performance hit). It passes the tests, but the changes hurt readability. I welcome suggestions for a more elegant solution.

EdwardCooke commented 4 months ago

One idea to help reduce the breaking change on the supporting interfaces like the value serializer. Create 2 interfaces. The first would be ivalueserializerbase and would have nothing in it. The old interface would inherit it. The new interface would also inherit it. The new interface would have the async methods. In the builders you change the method signatures and everywhere else to take the base interface. Where the current interface is called you would do a check. If the type is the old one, call the synchronous method, else call async. Hopefully that makes sense. It’s how I’ve done changes like this in the past and it worked well.

DanielBeckwith commented 4 months ago

I've been going around in circles with implementing this, primarily because there's no safe way to call async methods in synchronous methods without risking a deadlock (if there is and I've missed it, I would love to be wrong). As a consequence, it's not safe to allow the user to register, for example, an IAsyncEventEmitter in the SerializerBuilder, because the user could then use the synchronous serialization path and get a surprise deadlock (which could even potentially be non-deterministic depending on the synchronization context). Because of that, I think I'm going to have to split the API into a synchronous side (which is just the API as it is now) and an asynchronous side (which can handle synchronous and asynchronous versions of most, but not all, interfaces) from the builders all the way down, expanding existing implementations in the library as necessary and possible (for example, the existing modifications that I did for Emitter, with the only addition being implementing a new IAsyncEmitter interface). For example, an AsyncSerializerBuilder that accepts both IEventEmitters and IAsyncEventEmitters (and other customizable parts that async makes sense with), and Builds an IAsyncSerializer.

zivillian commented 3 months ago

Given that these interfaces (and others like them) are routinely used by library users to personalize the de/serialization process to meet their needs, this might become quite a burden and prevent them from updating to the newest version. An alternate solution could be providing a default implementation of the *Async method that just calls the synchronous version, but I think that could lead to some hard-to-catch foot-guns.

I would prefer extending the existing interfaces and create abstract base classes, which just call the synchronous versions. These can then be used by library users to easily upgrade without being required to implement all async methods. The base classes should not be used inside the YamlDotNet library and may also be marked as obsolete to clearly state that this is not the suggested way, but only a workaround to postpone the required changes.

EdwardCooke commented 3 months ago

Have you looked in to how system.text.json handles this?

DanielBeckwith commented 3 months ago

@zivillian Thank you for the input and the suggestions. I got partially finished with another commit that addresses some of that, but it's been a while since I've touched it, so it'll be a while before I can complete and push it.

@EdwardCooke I looked into the source code of System.Text.Json (and plan to study it more) to see how they do it, but not much of it is transferable due to the vastly different infrastructure/internals.

The main infrastructure issue is that the Deserializer is a direct data pipeline from the Stream, to the components (NodeDeserializers, Converters, etc), to the completed object (and the reverse pathway for the Serializer). Because of that, if even one of the configurable components in the chain uses the synchronous path (e.g. if I included a default implementation of a DeserializeAsync method in the INodeDeserializer interface that is implemented as return Task.FromResult(this.Deserialize(...));, which ultimately uses the IParser class to interface with the Stream), the entire call chain will stop being asynchronous and will be locked into the synchronous path. That essentially makes the async path useless for anyone who customizes even one component without implementing the *Async version of the method. We'd have to reengineer how most of the components work in order to bypass this issue.

System.Text.Json appears to bypass this issue by effectively (and efficiently?) buffering the current stage with Utf8JsonReader (for deserialization) and Utf8JsonWriter (for serialization), but I haven't had time to figure out if that's possible to emulate in this library. If it is, it'll probably be by modifying the implementations of IParser and IEmitter (respectively) that are provided to the components so that the sync/async reading/writing is handled outside, in the method that provides them to the components.