datafusion-contrib / datafusion-functions-variant

DataFusion package for the Open Variant type
Apache License 2.0
4 stars 0 forks source link

Motivation, performance and intended use? #11

Open samuelcolvin opened 2 months ago

samuelcolvin commented 2 months ago

@wjones127 this is very interesting, but having read https://github.com/apache/spark/tree/master/common/variant, I'm struggling to understand which queries would be fast with the Open Variant Type (OVT) — that document dives straight into the details of the format, without taking time to provide the motivation for the format — where they expect it to be useful. Let alone any quantative measures of how much faster it should be.

To make these questions more concrete, here are some direct questions:

Will the following queries be faster than datafusion-functions-json, if so by roughly how much

In particular, is there any way to exclude some rows from a filter by knowing they don't contain a specific key?

I guess benchmarks vs datafusion-functions-json would be extremely interesting.

cc @davidhewitt @adriangb


Some background, based on my experience building jiter I have the kernel of an idea for a binary format to encode JSON data, but I want to avoid building it if OVT is going to really work, especially since OVT is (to one degree or another) a standard that's endorsed by apache / spark.

davidhewitt commented 2 months ago

It looks like these benchmarks could already be built using open-variant crate in this repo. From what I see, it should be more efficient to store and query than text JSON because there's no cost for parsing integers or similar.

Don't have as good a sense for serialization speed, but I'd be tempted to guess that avoiding formatting integers etc is a similar win.

Relative size: I assume compressed JSON should be close to native OVT.

using a binary format like this means we have to decode rows to JSON or similar whenever we want to show them to users, any idea what the overhead will be of this? (I guess this is the same as the "deserializing" question)

Yes but we could push this compute onto user frontends and just transfer open-variant across the wire if it's a standard. In this sense, the transfer may be more efficient and the presentation can be lazy.

wjones127 commented 2 months ago

@samuelcolvin There is some discussion of performance throughout this presentation: https://www.youtube.com/watch?v=jtjOfggD4YY

Overall, I haven't gotten far enough where I've done much benchmarking. If you have some time and motivation, you are more than welcome to do some experiments and share your results. I'll try to share my thoughts so far and answer your questions.

will OVT allow vectorised operations directly on the contained data?

Not 100% what you mean by vectorized. When I think vectorized I think operating on columnar data, and this is decidedly not columnar. But the idea is we would have fast kernels for extracting data into a columnar format, and then from there we would have fast vectorized operations.

So my initial idea is that this expression: ovt_column->'foo'=42, would be decomposed into variant_cast(variant_get_field(ovt_column, 'foo'), Int64) = 42. So the variant_cast() function here would move the values into a Int64Array, and we could do a fast vectorized equality operation to do the comparison. Although, it's probably worth benchmarking whether that extraction is worth it over just doing the comparison on the binary value in the existing buffer. In which case, we could write a more specialized kernel.

will serializing data to OVT be faster than to JSON, if so by how much? same for deserializing?

I think the question is what format are you serializing/deserializing from? Hard to answer without knowing that.

My intention for use is that unstructured data will be written to OVT. Then OVT will be used for querying.

what's the relative size of OVT vs JSON

Basically, you get the space savings of moving all keys into a dictionary. Otherwise, I would think it's roughly similar. But I don't have any benchmarks yet.

using a binary format like this means we have to decode rows to JSON or similar whenever we want to show them to users, any idea what the overhead will be of this? (I guess this is the same as the "deserializing" question)

I guess I imagine that eventually I'd have some direct way to print. Then the cost of formatting wouldn't be much different than cost of formatting JSON to a terminal or HTML or whatever.

samuelcolvin commented 1 month ago

Thanks so much @wjones127, this is useful, I've just watched the video you linked to which was illuminating.