dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.32k stars 4.74k forks source link

[API Proposal]: Allow `CborReader` to read from a stream #99993

Open knutwannheden opened 7 months ago

knutwannheden commented 7 months ago

Background and motivation

When reading very large documents or documents that are read via a slow network connection, it seems very limiting that the whole document must be read into contiguous memory before it can be parsed. That can consume a lot of memory and doesn't allow an application to implement streaming in a reasonable way.

API Proposal

No concrete proposal. It should just allow data to be read from a stream.

API Usage

None.

Alternative Designs

No response

Risks

No response

knutwannheden commented 7 months ago

Getting very close to ID 100,000 here 😄

Tagging @bartonjs here which I saw commenting on the only other CBOR issue I could find.

bartonjs commented 7 months ago

Reading from a stream would suggest that we'd also want Async versions of all of the reader methods, because there's no guarantee that the next element can finish without blocking; so it's a somewhat expensive proposal.

How big of CBOR documents are you working with? When we were spinning up the project to make the reader everything was just a few kilobytes.

knutwannheden commented 7 months ago

Could be megabytes. I was however also thinking that streaming would be nice because it would let a reader start processing the tokens even before all data has been transferred by the client.

kasperk81 commented 7 months ago

this nodejs package supports streaming https://github.com/kriszyp/cbor-x?tab=readme-ov-file#streams. if format spec doesn't restrict block reading all data for encoding or decoding, can dotnet implementation change to process data progressively too?

AlgorithmsAreCool commented 7 months ago

@knutwannheden Is this proposal just for stream reading helpers that accept a stream and return a CborReader? Or is it for an incremental reader api that would allow interpreting partially downloaded data?

knutwannheden commented 7 months ago

@AlgorithmsAreCool If I understand your question correctly, it is the latter. So the CborReader would read bytes from the stream (as necessary) whenever a method is called on the reader to return the next token.

AlgorithmsAreCool commented 7 months ago

In that case, I would also be interested in a CborReader that had the incremental API much like JsonDocument that allowed us to read from very large CBOR documents and possibly CBOR Sequences. CBOR is basically binary JSON and we already accomodate massive JSON documents, so I think this is natrual.

But it is a big feature and new API surface

knutwannheden commented 6 months ago

Further, allowing the CborWriter to also write directly to a Stream would also feel like a sensible addition. For my use case I have no need for async methods, as I don't have any desire to "async all the way up" my code base.

IS4Code commented 3 months ago

I was quite confused seeing that there are no methods for operating with Stream instances, both for reading and writing. Are we supposed to process data in one big array like it's the C age again?

The current "buffer" approach is fine, but there must be away to update the buffer without resetting the whole state. This could be quite a novel approach to semi-async data processing (synchronous parsing, asynchronous advancing), but without it, CborReader and CborWriter are pretty much unusable.