Closed droundy closed 3 months ago
Hi!
Thanks for opening this issue.
The need for a ~'static
Writer is something I have considered, and imagined a design for. However I figured it would be built on top of the current API, (and probably outside of serde_avro_fast
until that shows to be a somewhat common need).
Here are the key elements why I like the current design for the low-level API:
&mut SerializerConfig
makes buffer reuse work very naturally and easily even across errors. It makes the most natural writing also the most performant. If the buffers were owned by the serializer that would be significantly more boilerplate-ish, in particular for single-datum functions, and in particular if handling errors (see the usage docs before 25384fabaa3cdf3e9dfe2614199a9464e8e4eda7).I agree that for the writer this is less relevant: instantiating a writer is not meant to happen too often, so buffer reuse across writer instanciations is less critical. (That being said I do like that it's available.)
Now since the Writer
is built on top of the same SerializerState
that takes in &mut SerializerConfig
until very deep in the serialization module (which I don't think we want to change because it looks like that would significantly worsen the low-level API), and the object_container_file_encoding Writer contains the SerializerState
, for the writer to not depend on SerializerConfig
, it would have to hold the SerializerConfig
in addition to the &mut SerializerConfig
, so the design would be to have it be self-referential.
Now if we do this, because self-referential is otherwise very non-trivial to get sound, this means we probably would want to use the ouroboros
crate for this.
So doing it in serde_avro_fast
itself would have two downsides:
ouroboros
crate to our dependencies, which is by nature not going to feel like the cleanest, most stable or most reliable of dependencies (especially considering that there is still active work in the compiler around self-references support), and I also fear that it might be heavy (with the proc macro, the fact that it would need to always allocate even if using the outside config...)Writer::with_inner_config()
, which I fear may become more natural to use than its slighly better counterpart that takes a ref to the config (although with the main one being called new
and this one being called with_inner_config
it would probably be ok).Feature-flagging it looks like it solves both these issues, but maybe at the cost of maintaining a part of the writer code twice with significant differences conditionally on whether the feature is enabled (due to the necessity of doing with_writer_mut(|writer| ...)
with ouroboros
, and also that would leave us with a dangling feature flag if we move out from ouroboros
.
If a user needs to do this, it looks like they could implement it themselves using ouroboros
:
https://github.com/Ten0/serde_avro_fast/blob/19ff886f2668fbed8e2bdf0bed45000e19235741/serde_avro_fast/tests/self_referential_writer.rs#L13-L19
(this seems reasonably non-boilerplate-ish as long as their writer has a stub value)
So overall it looks like the feature would make sense, it's probably ok to have, but I'm unclear on whether this should be a feature flag or not, and whether this is a common enough use-case that we want to do the implementation as part of serde_avro_fast
instead of letting the user manage the self-referential part themselves.
What do you think?
Thanks for pointing me at ouroboros. I've been skeptical of that crate (and the other self-referential structure crates, but maybe that's a way for me to move forward with switching to serde_avro_fast
, which I'd like to do.
In terms of future directions, I'd tend to suggest (for next breaking change) to somehow separate the buffer from the config, and pass the buffer as a separate argument rather than storing a reference internally. It sounds like a somewhat painful change, but in general holding a &mut
within a struct
is quite problematic, so if I were you I'd go far out of my way to avoid that.
Using ouroboros
internally does not sound to me like a great plan, even if it is easy. Surely it wouldn't be that hard to at least make a non-exported API that takes a separate &Schema
and &mut Buffers
as inputs, so the essential code could be shared between both paths. And then maybe an ordinary Config
that only holds configuration and not actual buffers, and could even be Copy
.
I'm struggling to get a clear idea of the design you're suggesting.
I'd tend to suggest (for next breaking change) to somehow separate the buffer from the config, and pass the buffer as a separate argument rather than storing a reference internally [because] in general holding a
&mut
within a struct is quite problematic
Do you mean owned in the low level API (in SerializerState
), as was done before 25384fabaa3cdf3e9dfe2614199a9464e8e4eda7? If so, doesn't it look like that would significantly worsen the API around buffer reuse that was updated in that commit?
If the buffers were owned by the serializer that would be significantly more boilerplate-ish
takes a separate &Schema and &mut Buffers as inputs
Or do you mean mutably borrowed in the low level API (in SerializerState
)? If so, I don't understand how that solves the issue we currently have with &mut SerializerConfig
. It looks like we'd just have the same issue with &mut Buffers
as we currently have with &mut SerializerConfig
, only the user would have both SerializerConfig
and Buffers
to manage as part of the public API.
Specifically, how one writes this line: https://github.com/Ten0/serde_avro_fast/blob/f13550699d434ca21251591fc452d8e8ea5adab9/serde_avro_fast/src/object_container_file_encoding/writer/mod.rs#L491 if the buffers (or config) are meant to be stored owned in the Writer
but SerializerConfig
holds &mut Buffers
/&mut SerializerConfig
is the issue in both cases.
Alternatives I see are:
SerializerState
at every call to Serialize (sad because that means either we got an additional deref level at every write (if we use &mut Vec<u8>
as W
) or we need to std::mem::swap()
the full Vec<u8>
every time, and swap it back at the end, probably using some PutBack
kind of struct with a destructor so that it's put back no matter the way we exit => heavier and uglySerializerState
that says whether it owns the buffers or &mut the buffers -> I'm not sure which is best between the branching at every usage + extra struct size or having the enum variant for the owned version contain a Box
here, either way feels a bit dirty in the SerializerState
, but maybe that's the most practical approach.Yeah the more I think about it the more I think the best option for this would be to have SerializerState
contain
enum SerializerConfigRef<'c, 's> {
Borrowed(&'c mut SerializerConfig<'s>),
Owned(SerializerConfig<'s>),
}
(possibly boxed in the owned case)
My use pattern involves creating a new
Writer
every minute (for a new file), and each writer remains open for two minutes so we can write to the "old" minute if some delayed messages come in.I'm finding the lifetimes for this very challenging to manage with the pattern of requiring a
&mut SerializerConfig
. I can appreciate the value of reusing buffers when serializing a single datum, but creating the buffers once per minute would be no problem ifWriter
were to own its buffers rather than borrowing them.I don't have a great idea for an elegant API change unless you were willing to go all the way and make
Writer
always own its buffers, which would be a breaking change.