Async (non-blocking) support for CSV

robinroos commented 3 years ago

This question relates to usage of Jackson DataFormats CSV within Spring WebFlux.

I have a REST controller (Spring WebFlux) which returns a Flux from a Stream of a simple bean (MapInfo). The endpoint serves application/json (only for demo) and application/x-ndjson (for large datasets). I'd like to add support for text/csv also.

    @GetMapping(produces = { "application/json", "application/x-ndjson", "text/csv" })
    public Flux<MapInfo> mapInfo() {
        return Flux.fromStream(mapInfoStream());
    }

Invoking this for text/csv results in HTTP 500 with message: "No Encoder for [com.mizuho.fo.dataservices.hc.controller.MapInfoController$MapInfo] with preset Content-Type 'null'"

I have not yet added any Jackson Dataformat libraries so this is expected.

Question: Does Jackson Dataformat CSV have the necessary non-blocking support in order for it to be used to convert the payload to CSV?

Of course I could be imperative and write code to format as CSV, returning a Flux of String for instance, but my hope is to effect CSV output without lowering the abstraction level beyond what is already there (returning a Flux from a Stream of the POJO).

robinroos commented 3 years ago

Full source.

MapInfoController.txt

cowtowncoder commented 3 years ago

Currently only JSON and Smile format backends support async parsing. Some others (like CBOR) might be relatively straight-forward to support. Fundamentally there is nothing preventing CSV from being supported, but someone would have to spend quite a bit of time to implement it all -- and it probably could not use much code from JSON or Smile codecs due to decoding being rather different. So as of now there is no support at Jackson streaming level for async/non-blocking parsing of CSV content.

However: I think your question is more related to Spring-side of things, so Spring WebFlux folks/user community can probably talk more about intervening functionality, requirements.

martin-traverse commented 1 year ago

Hello, just checking if there is any update on this? I would very much like to use async CSV parsing if/when it becomes avaialble.

We have a data component that converts to/from a number of supported formats using Apache Arrow as the common intermediate. We can receive incoming JSON with true streaming, but for incoming CSV we have to insert a buffering stage. For large datasets we're convering CSV -> JSON in the client, if we had CSV streaming we could send large CSV files straight up which would be a lot faster.

cowtowncoder commented 1 year ago

No update; I don't really have time to work on this, although if someone was to tackle it, I'd do my best to help. I agree, async decoding would be pretty valuable Unfortunately existing async decoders from json and smile (or Aalto xml) aren't of much help here as state machines are quite different.

One thing to note tho is that for just simple streaming, module already has that. Amount of buffering used by default is not much more than JSON parsing (no requirement to decode full line if I recall), basically only needs one full cell. But this isn't async parsing yet of course.

martin-traverse commented 1 year ago

Hello, thanks forgetting back to me. I can't promise if/when I'll get any time (such is life!) but let me see if I've got the shape of the problem:

Seems like the hardest part might be at the input stream / reader level. You don't know how wide the incoming chars will be and if you hit a block on the input stream then you're stuck. I see you have a bunch of specialised readers, do any of them handle this? What happens if the input is not in UTF-8? I'd guess this part of the solution can be reused from the JSON parser?
In the decoder, _nextQuotedString and _nextUnquotedString. Is it ok to use the same decoder and return null if we're waiting for data? Perhaps add a constructor flag to indicate streaming mode and some method like isWaitingForMore() ?
In the parser, handle the null return values and check isWaitingForMore(). Probably leave _state alone and set a separate flag _wait, so it's easy to resume later. Ideally leave all the existing state logic alone and just make sure we always drop out of the handlers if we hit _wait. So long as the decoder didn't move on, we can go back into nextToken() later.

Is that the shape of it or am I way off the mark? Appreciate there's a lot more detail! I was slightly confused by the "simple streaming" bit (I saw this on the README page as well) - is this referring to the implementation, i.e. the decoder doesn't buffer the whole content from the stream, it consumes tokens one at a time?

I haven't looked at all at the object mapper level, I don't use it myself.

yawkat commented 1 year ago

@martin-traverse you can take a look at how the non-blocking json parsers in jackson-core are implemented.

martin-traverse commented 1 year ago

Thanks @yawkat just had a quick look. Seems like it required a largely separate implementation for both UTF-8 decoding and JSON parser / state logic. The decoding bit could perhaps be factored out so it can be shared between parsers?

I'll try to find some time to sketch out a PR - can't promise though, the next few weeks are pretty busy....

cowtowncoder commented 1 year ago

Right, I think async decoder for CSV would probably quite a bit simpler, but some aspects (UTF-8 decoding) are similar. I doubt refactoring is possible, however, partly since combining UTF-8 character decoding and tokenization is (for JSON and Smile at least) an important reason for good performance.

The part that differs is the state machine, needed to keep exact state for cases that you mention (end of content within token, or even within one UTF-8 character).

And yes, trying to support encodings other than UTF-8 would be tricky with approach I used for JSON and Smile codecs (and Aalto XML as well).

I guess additional complexity for CSV would be configurable escaping settings.

Alternatively a completely different approach would be one where decoding of character encoding was separate from tokenization. This would probably be slightly simpler; and first part in particular could probably be done for a buffer at a time (and only decoding to be incremental). There'd be some more work in syncing those layers but that could probably lead to somewhat more reusable code -- I wouldn't retrofit that for JSON/Smile for various reasons, but it could be used for other textual formats for sure.

FasterXML / jackson-dataformats-text

Async (non-blocking) support for CSV #287