FasterXML / jackson-dataformats-text

Uber-project for (some) standard Jackson textual format backends: csv, properties, yaml (xml to be added in future)
Apache License 2.0
407 stars 149 forks source link

Async (non-blocking) support for CSV #287

Open robinroos opened 3 years ago

robinroos commented 3 years ago

This question relates to usage of Jackson DataFormats CSV within Spring WebFlux.

I have a REST controller (Spring WebFlux) which returns a Flux from a Stream of a simple bean (MapInfo). The endpoint serves application/json (only for demo) and application/x-ndjson (for large datasets). I'd like to add support for text/csv also.

    @GetMapping(produces = { "application/json", "application/x-ndjson", "text/csv" })
    public Flux<MapInfo> mapInfo() {
        return Flux.fromStream(mapInfoStream());
    }

Invoking this for text/csv results in HTTP 500 with message: "No Encoder for [com.mizuho.fo.dataservices.hc.controller.MapInfoController$MapInfo] with preset Content-Type 'null'"

I have not yet added any Jackson Dataformat libraries so this is expected.

Question: Does Jackson Dataformat CSV have the necessary non-blocking support in order for it to be used to convert the payload to CSV?

Of course I could be imperative and write code to format as CSV, returning a Flux of String for instance, but my hope is to effect CSV output without lowering the abstraction level beyond what is already there (returning a Flux from a Stream of the POJO).

robinroos commented 3 years ago

Full source.

MapInfoController.txt

cowtowncoder commented 3 years ago

Currently only JSON and Smile format backends support async parsing. Some others (like CBOR) might be relatively straight-forward to support. Fundamentally there is nothing preventing CSV from being supported, but someone would have to spend quite a bit of time to implement it all -- and it probably could not use much code from JSON or Smile codecs due to decoding being rather different. So as of now there is no support at Jackson streaming level for async/non-blocking parsing of CSV content.

However: I think your question is more related to Spring-side of things, so Spring WebFlux folks/user community can probably talk more about intervening functionality, requirements.

martin-traverse commented 1 year ago

Hello, just checking if there is any update on this? I would very much like to use async CSV parsing if/when it becomes avaialble.

We have a data component that converts to/from a number of supported formats using Apache Arrow as the common intermediate. We can receive incoming JSON with true streaming, but for incoming CSV we have to insert a buffering stage. For large datasets we're convering CSV -> JSON in the client, if we had CSV streaming we could send large CSV files straight up which would be a lot faster.

cowtowncoder commented 1 year ago

No update; I don't really have time to work on this, although if someone was to tackle it, I'd do my best to help. I agree, async decoding would be pretty valuable Unfortunately existing async decoders from json and smile (or Aalto xml) aren't of much help here as state machines are quite different.

One thing to note tho is that for just simple streaming, module already has that. Amount of buffering used by default is not much more than JSON parsing (no requirement to decode full line if I recall), basically only needs one full cell. But this isn't async parsing yet of course.

martin-traverse commented 1 year ago

Hello, thanks forgetting back to me. I can't promise if/when I'll get any time (such is life!) but let me see if I've got the shape of the problem:

Is that the shape of it or am I way off the mark? Appreciate there's a lot more detail! I was slightly confused by the "simple streaming" bit (I saw this on the README page as well) - is this referring to the implementation, i.e. the decoder doesn't buffer the whole content from the stream, it consumes tokens one at a time?

I haven't looked at all at the object mapper level, I don't use it myself.

yawkat commented 1 year ago

@martin-traverse you can take a look at how the non-blocking json parsers in jackson-core are implemented.

martin-traverse commented 1 year ago

Thanks @yawkat just had a quick look. Seems like it required a largely separate implementation for both UTF-8 decoding and JSON parser / state logic. The decoding bit could perhaps be factored out so it can be shared between parsers?

I'll try to find some time to sketch out a PR - can't promise though, the next few weeks are pretty busy....

cowtowncoder commented 1 year ago

Right, I think async decoder for CSV would probably quite a bit simpler, but some aspects (UTF-8 decoding) are similar. I doubt refactoring is possible, however, partly since combining UTF-8 character decoding and tokenization is (for JSON and Smile at least) an important reason for good performance.

The part that differs is the state machine, needed to keep exact state for cases that you mention (end of content within token, or even within one UTF-8 character).

And yes, trying to support encodings other than UTF-8 would be tricky with approach I used for JSON and Smile codecs (and Aalto XML as well).

I guess additional complexity for CSV would be configurable escaping settings.

Alternatively a completely different approach would be one where decoding of character encoding was separate from tokenization. This would probably be slightly simpler; and first part in particular could probably be done for a buffer at a time (and only decoding to be incremental). There'd be some more work in syncing those layers but that could probably lead to somewhat more reusable code -- I wouldn't retrofit that for JSON/Smile for various reasons, but it could be used for other textual formats for sure.