FasterXML / jackson-dataformats-text

Uber-project for (some) standard Jackson textual format backends: csv, properties, yaml (xml to be added in future)
Apache License 2.0
405 stars 146 forks source link

Limit the lengths of parsed CSV cells #70

Open aengelberg opened 6 years ago

aengelberg commented 6 years ago

I regularly work with dirty CSVs that have overly long cells between separators for some reason:

a,b,c
d,e,fffffffffff...

or have a misplaced double quote that mistakenly implies an especially large cell:

a,b,c
d,e,"f
g,h,i
j,k,l
... (the rest of the file is one cell?)

Both of these cases trigger an OutOfMemoryError if I use the Jackson parser. I would like to set a hard limit of, say, 1MB per cell, so that Jackson will halt before trying (and failing) to buffer large amounts of text into memory.

aengelberg commented 6 years ago

Are there any known workarounds that would let me effectively achieve this behavior with the current Jackson API? For example, plug in some kind of faux "parser" that throws away data if it goes over a certain threshold?

cowtowncoder commented 6 years ago

Unfortunately I don't think this would be easy thing to do right now.

In theory such things are doable: for example XML parsers often support this. Woodstox, for example:

https://medium.com/@cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

has a big of set of maximum size limits.

What is generally needed is support from low-level parser (JsonParser and subtypes) so that they can enforce limits: usually when reading a new buffer-full of data (so every 4k bytes or characters). But that has to be done for each format backend separately. After parser level limits are little bit too late enforce.

Another possibility which could be bit more general would be to allow settings in buffering class (TextBuffer I think); this would be more generic, if coarser.

But I fear that tackling this problem would be best done with Jackson 3.0 (under development) -- reason being that it allows much better configurability of format backends.

So... I don't really have a good solution at this point. What could perhaps work, from your end, is writing custom Reader subtype that wraps read() method. If it could interact with higher level code, it could throw exception if a length maximum was violated. Not completely sure how it should interact (perhaps caller would need to effectively reset state between rows or tokens), but that would be one approach.