dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.28k stars 4.74k forks source link

IBufferWriter<T>.GetSpan, Advance, then write to original span (before flush)? #28279

Closed AArnott closed 4 years ago

AArnott commented 5 years ago

Is it considered fair play to use this pattern:

void WriteArrayWithLengthHeader<T>(IBufferWriter<T> writer, IEnumerable<T> sequence)
{
  // reserve space for the header
  int lengthSpan = writer.GetSpan(4);
  writer.Advance(4);

  int count = 0;
  foreach (T element in sequence)
  {
    count++;
    Write(writer, element);
  }

  BitConverter.GetBytes(count).CopyTo(lengthSpan);
}

Notice how I request a span and then Advance past it, so that more content can be written after it, but then go back and write to the original span. Depending on the IBufferWriter<T> implementation, this might work. But I wonder what the interface design would dictate about whether implementations are required to support such a pattern, and thus whether I can depend on all implementations supporting it, or if writing to a span after Advancing is considered not supported since the memory might get consumed or recycled immediately after advancing?

Note that if this IBufferWriter<T> comes from a PipeWriter<T>, no one has called FlushAsync in this scenario (yet).

I'm looking for an answer to the question, and also suggesting that we keep the issue open till the docs (which are basically empty at the moment) are updated to include the answer.

benaadams commented 5 years ago

Yes, Advance doesn't commit the bytes. Note: WriteAsync does call FlushAsync and commit the bytes

/cc @davidfowl @pakrym

pakrym commented 5 years ago

This pattern is not recommended, it happens to work with current implementation but I'm not sure if we want customers relying on it.

Unfortunately this means we don't have a good solution for preallocating write space for now.

AArnott commented 5 years ago

Unfortunately this means we don't have a good solution for preallocating write space for now.

Ya, this seems like a very reasonable requirement, particularly when these APIs were developed to minimize memory copying, since the only alternative to this approach when you can't predict the length of the data is to write out the data to a temporary buffer, then assess its length, write out the length, then copy the temp buffer over to the actual memory.

it happens to work with current implementation but I'm not sure if we want customers relying on it.

This is concerning, since I suspect it's a pretty common requirement and it works -- so if you were to change something such that this breaks, it seems likely you'd break folks. I think we should either break or embrace this pattern in fairly short order, before many more customers hit this.

You say the "current implementation" but there are multiple implementations already -- it is just an interface, after all. I feel like the interface itself should provide a way to do this so it's useful without knowing that the concrete time is (or isn't) PipeWriter.

Personally, I think the way I describe in the issue description is a very natural way to use it, and would love to see this simply adopted as the supported way.

ahsonkhan commented 5 years ago

Unless we break the interface by adding some new API like Flush, I don't think the scenario can be supported.

Once you call Advance, you have signaled to the implementation that you have written some amount to the output (which can then be flushed, or the cursor can safely move). If we don't want Advance to signal that, we need some other API to signal that we have actually written some amount.

Here are some (less than ideal but functional) options for your use case while keeping the interface as is:

the only alternative to this approach when you can't predict the length of the data is to write out the data to a temporary buffer, then assess its length, write out the length, then copy the temp buffer over to the actual memory.

Not always (depending on your scenario). For example, in the Utf8JsonWriter, we have two cases: 1) Single transformation of data before writing (eg transcoding from utf-16 to utf-8): Try to write the transformed data (partially) in a loop to the output span that you received from the IBufferWriter and if you run out of room, Advance and ask for more space. 2) Multiple transformation of data before writing (eg escaping the string, and then transcoding from utf-16 to utf-8): We end up needing a temporary buffer in this case anyway and the IBufferWriter doesn't add any overhead. Do the n-1 transformations and follow option 1 for the last transformation.

For WriteArrayWithLengthHeader, we have a similar scenario in Utf8JsonWriter (i.e. WriteArray) but we have overloads of the API for specific Ts which have known max sizes (like numeric types, Datetime, Guid, etc.) so we generally know the maximum possible size (element count * max element size).

AArnott commented 5 years ago

Once you call Advance, you have signaled to the implementation that you have written some amount to the output (which can then be flushed, or the cursor can safely move). If we don't want Advance to signal that, we need some other API to signal that we have actually written some amount.

And yet the PipeWriter has a separate FlushAsync method which has this meaning. Why are we applying that meaning to both IBufferWriter<T>.Advance and PipeWriter<T>.FlushAsync?

This API hasn't been documented yet, I think. I wonder if it's not too late to apply some changes to the originally intended meaning in order to salvage this scenario and/or avoid folks depending on something they shouldn't.

ahsonkhan commented 5 years ago

And yet the PipeWriter has a separate FlushAsync method which has this meaning. Why are we applying that meaning to both IBufferWriter<T>.Advance and PipeWriter<T>.FlushAsync?

I can't really speak to the design of PipeWriter (in relation to IBufferWriter) intelligently, but it seems odd to force a change to an interface based on the details of a single implementation of it.

This API hasn't been documented yet, I think. I wonder if it's not too late to apply some changes to the originally intended meaning in order to salvage this scenario and/or avoid folks depending on something they shouldn't.

If we change the semantics of Advance, then all implementations of IBufferWriter would need to implement an equivalent of Flush. So, a documentation change is likely not sufficient.

cc @davidfowl, @KrzysztofCwalina

benaadams commented 5 years ago

I was going on this part of the comment

Note that if this IBufferWriter<T> comes from a PipeWriter<T>, no one has called FlushAsync in this scenario (yet).

AArnott commented 5 years ago

If we change the semantics of Advance, then all implementations of IBufferWriter would need to implement an equivalent of Flush. So, a documentation change is likely not sufficient.

Only those implementations that actually would do anything for it. My own implementation of this interface wouldn't need a Flush, for instance, because everything always stays in memory.

AArnott commented 5 years ago

Can we please document the intended behavior of IBufferWriter<T>.GetSpan when called multiple times in a row? For instance, if I ask for 4 bytes, then ask for 20 bytes, what happens to the 4 byte buffer? Is it potentially discarded, or assumed to still be a source for data in a subsequent call of Advance?

ahsonkhan commented 5 years ago

Only those implementations that actually would do anything for it. My own implementation of this interface wouldn't need a Flush, for instance, because everything always stays in memory.

In that case, what should be the semantics of Advance, if it is not indicating some movement of an underlying cursor?

Otherwise, I agree that the contract needs to be documented (including the case you mentioned above).

AArnott commented 5 years ago

In that case, what should be the semantics of Advance, if it is not indicating some movement of an underlying cursor?

To reserve that many bytes in a buffer previously returned from GetMemory/GetSpan, so that a subsequent call to get a write buffer will not return that same buffer again.

omariom commented 5 years ago

As I understand it, the question is what operations invalidate previously returned spans. In both, PipeWriter and IBufferWriter.

The answer is a part of the contract and must be reflected in the documentation.

AArnott commented 5 years ago

From observation, the only IBufferWriter<T> operations that invalidates returned spans in the PipeWriter implementation is requesting a subsequent span with a larger size than the size of the previously returned one. Advancing doesn't invalidate. Requesting a span of 10 (and getting 2K in return) followed by another request for anything <= 2K will return the same span. Requesting >2K will get a new span, after which the original span is ignored (although it's included as a 0-length contributor in the ReadOnlySequence<T> made available to the PipeReader later on.)

The answer is a part of the contract and must be reflected in the documentation.

I wholeheartedly agree. And I find the current behavior in PipeWriter to be rather non-intuitive, at least without documentation.

davidfowl commented 5 years ago

@AArnott FWIW I agree and we had long debates about this. The last debate was that it wouldn't be possible to write an IBufferWriter that transforms the data in anyway this becomes a problem (like a compressing IBufferWriter that compresses data in Advance). I'm not sold this is a valuable enough scenario to stop the length prefix one (I like this change alot and we've discussed it before for this very reason) but we need to decide if it's important or not.

AArnott commented 5 years ago

Thanks, @davidfowl. I just tried (and utterly failed) to apply IBufferWriter<T> to MessagePack (along with ReadOnlySequence<T> support). The API and internals of the library are now nicer, but the perf tanked badly, even with a trivial IBufferWriter<T> implementation that did nothing but return a sliced array from GetSpan and increment an integer in Advance. I think it was all the interface dispatches where MessagePack previously just passed the buffer and index around to every method. It's a highly tuned library, and my desire to just use IBufferWriter<T> "everywhere" was a mistake. I was using the interface for every single little write operation. I'm now planning to revert many of my changes and instead pass around a buffer, an index, and an IBufferWriter<T> that we'll now use only when we run out of buffer.

What has that to do with this topic? Well, the idea of making Advance do anything other than a very quick operation sounds like a bad idea, since consumers may call Advance very frequently with 1-4 bytes of data.

davidfowl commented 5 years ago

It's a highly tuned library, and my desire to just use IBufferWriter<T> "everywhere" was a mistake. I was using the interface for every single little write operation. I'm now planning to revert many of my changes and instead pass around a buffer, an index, and an IBufferWriter<T> that we'll now use only when we run out of buffer.

Yes, we had the same experience with Kestrel and ended up writing this type https://github.com/aspnet/AspNetCore/blob/54021a8bc9282357d843281802edfb315997a10a/src/Servers/Kestrel/Core/src/Internal/BufferWriter.cs#L8 which seems like it might be useful here.

cc @KrzysztofCwalina @ahsonkhan

What has that to do with this topic? Well, the idea of making Advance do anything other than a very quick operation sounds like a bad idea, since consumers may call Advance very frequently with 1-4 bytes of data.

I think you might be right.

pakrym commented 5 years ago

since consumers may call Advance very frequently with 1-4 bytes of data.

They may but they shouldn't if they care about performance. It's the same as calling Stream.Write with 1-4 bytes, the overhead of a single operation is just to high.

@davidfowl failed to mention another more important problem with allowing IBW consumers to keep references to memory past advance.

Consider the following example:

var writer = new SomeLenghtPrefixSerializer(response.BodyPipe);
foreach (var o in objects)
{
   writer.WriteObject(o);
   // Is this safe to do or is writer still holding onto memory?
   await response.BodyPipe.FlushAsync();
}

It's impossible to know if SomeLenghtPrefixSerializer had stored some memory away to write a length prefix later on and if FlushAsync would be safe or would cause memory to be returned to pool while still being referenced and end up corrupted in the future.

AArnott commented 5 years ago

@pakrym Your code sample leaves much to be imagined as far as the types used here. But folks can't call FlushAsync if they only have an IBufferWriter<T>. If I were using the length prefixing technique, I would not pass my PipeWriter to someone that might call FlushAsync on it, since I know the PipeWriter type's documented behavior for FlushAsync makes the buffer available to be read. But I would consider it safe to pass it to a method that takes an IBufferWriter<T>, since that method presumably won't downcast just to call FlushAsync on it.

davidfowl commented 5 years ago

I'd argue that we care about what Advance does not FlushAsync, those are unrelated things. The above is misuse (albeit hard to catch). We're talking about synchronous writes, not about what happens outside of those method calls. If you get rid of the memory the serializer is using during serialization then its already broken and that's not new, nor is it worse than what exists today. Consider this:

var writer = new MyWriter(response.BodyPipe);
writer.WriteInt(1);
await writer.FlushAsync();
writer.WriteInt(3);

In the above, if the underlying writer called Flush between writes everything would be broken. Presumbaly, MyWriter called GetMemory and stored it in a field, and when the caller called FlushAsync that memory went away without them knowing.

pakrym commented 5 years ago

But that's a very easy misuse to run into. If we say that memory is valid past advance no-one would make sure not to store it outside of a single method call.

AArnott commented 5 years ago

I don't think I agree. When I apply the length prefix trick, I understand it's a trick and I localize it in a single method. I can't imagine when/how I would split it apart across methods and not feel like my situation was very fragile.

pakrym commented 5 years ago

When I apply the length prefix trick, I understand it's a trick

You understand how everything works in this case and that's why would avoid using pattern that breaks but customers might not especially if we give them not so clear guidance.

I can't imagine when/how I would split it apart across methods and not feel like my situation was very fragile.

Lets look at Utf8JsonWriter usage pattern:


var writer = new Utf8JsonWriter(pipe.Writer);
writer.WriteStartObject();
writer.WriteProperty();
writer.WriteValue();
writer.WriteEndObject();

Now imagine that this writer needed length prefixing before every object, it's very very easy to write implementation that would capture memory for too long:

var writer = new MyNewWriter(pipe.Writer);
writer.WriteStartObject(); // preserver memory
writer.WriteProperty();
writer.WriteValue();
writer.WriteEndObject(); // write lenght prefix to that memory
davidfowl commented 5 years ago

One of the fundamental differences about how this interface differs from the something like Stream is that it gives you access to the buffers that it owns. There are very nice properties about this as it means the caller doesn't have to manage their own buffers (which was the intent). At the same time, you end up with a lifetime problems similar to Span and Memory. Consider:

void WriteSomething(IBufferWriter<byte> writer)
{
    var span = writer.GetSpan(); // Returns a 4K buffer
    span[0] = 1;
    writer.Advance(1);

    // Is span[0] now off limits because we called Advance(1)?
}

I think we need the buffers to live for the life of the call, which means Advance should not be used for anything destructive or trans-formative. Outside of those calls anything goes though and it's up to the caller to understand the lifetime semantics.

Generally this interface can have lifetime problems if memory out lives a single owner. If you have different parties interleaved that don't agree on the semantics. We just need to pick the semantics we think works best for the scenarios we care about.

AArnott commented 5 years ago

@pakrym One significant difference between these examples is who owns the writer. In my scenario, I'm not exposing a higher level Writer type that others can pass an IBufferWriter<T> into, and then make multiple calls into the higher level writer, and possibly also calling FlushAsync on the underlying writer between those calls. My scenario is where I pass an IBufferWriter<T> into a method as an argument, and it controls the writer until it returns.

It is within the context of a single method call that receives the writer where I think this trick is perfectly valid and useful for performance. I wouldn't feel good about using the trick across multiple top level calls as you have demonstrated.

And I'm comfortable with this distinction. The documentation can be made clear about the behavior of GetSpan and Advance such that folks know how they work. And then the higher level Writer class author can either avoid the trick if they don't consider that they exclusively own the IBufferWriter<T> instance, or they can very clearly document their behavior to warn their own users from flushing in-between top-level calls.

pakrym commented 5 years ago

One significant difference between these examples is who owns the writer.

I agree that is the difference but decision here affects both scenarios and that's why I bring it up.

And I'm comfortable with this distinction.

I'm honestly not, it's somewhat similar to Span, we could've added it without compiler support and documented that it shouldn't be stored outside the stack but we didn't. Unfortunately, there is no way to enforce correct behaviour for IBW at compile time so I would like to at least give out guidance that is simple to understand and follow.

AArnott commented 5 years ago

Here's the simple guidance I would suggest for this interface, all in API docs:

GetSpan/GetMemory:

Returns a buffer that may be written to. Successive calls (without calling Advance after each) may return the same buffer. Call Advance to indicate how many elements have been written to the buffer, and to ensure that a subsequent call to this method will not return the same buffer. The buffer may be written to even after calling Advance, but the buffer must be written to before giving up exclusive ownership of the IBufferWriter<T> implementation, since some implementations may offer additional methods that invalidate the buffer.

Advance:

Used to claim some number of elements in a buffer previously returned by GetSpan/GetMemory so that that space is not returned by future calls to those methods. Typically called after the buffer has been written to.

Any feedback on the above docs, I'll be happy to incorporate by editing this same message so we have all the thoughts concentrated in one spot.

Of note is that my ask that folks can write after calling Advance isn't what causes the problem that requires exclusive ownership of the writer. Consider that if I simply call GetMemory and hold that buffer without calling Advance, then if I'm not the exclusive owner of the writer, I still have a problem. It's critical to have exclusive ownership of the writer already (given the documented behavior we all agree to) at least until all bytes have been written to the buffer and Advance is called. I'm merely asking for that condition to work where those two requirements can occur in either order.

AArnott commented 5 years ago

I'm closing (without a fix) based on our discussion.