dcmoura / spyql

Query data on the command line with SQL-like SELECTs powered by Python expressions
https://spyql.readthedocs.io
MIT License
918 stars 25 forks source link

JSON benchmark #72

Open dcmoura opened 2 years ago

dcmoura commented 2 years ago

Leave your comments for the JSON benchmark here.

dnaaun commented 2 years ago

I'm curious as to why libraries like ultrajson and orjson weren't explored. They aren't command line tools, but neither is pandas right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, python -c "...")?

dcmoura commented 2 years ago

I'm curious as to why libraries like ultrajson and orjson weren't explored. They aren't command line tools, but neither is pandas right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, python -c "...")?

Thank you @davidatbu! The idea was to focus on querying tools. ujson and orjson (as well as the json module from python's standard library) offer json decoding and decoding but not a querying language: you need to implement the query logic in Python, resulting in large programs with lots of boilerplate. Still, I agree that Pandas is an outlier... it was included due to its popularity for querying datasets.

I should mention that spyql leverages orjson, which has a considerable impact on performance. spyql supports both the json module from the standard library as well as orjson as json decoder/encoder. Performance wise, for 1GB of input data, orjson allows to decrease processing time by 20-30%. So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.

You can find more info about the performance impact of orjson in https://github.com/dcmoura/spyql/pull/70.

davidatsurge commented 2 years ago

resulting in large programs with lots of boilerplate

That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.

Thank you for doing this benchmark! I've been using jq, but will probably switch to spyql now.

There's more to my reply on HN, (regarding the title of the shared link on HN), but I don't think that's relevant here, so I've left it out.

cube2222 commented 2 years ago

Hey @dcmoura!

I've actualy already done the necessary changes in OctoSQL that I described in the HN thread. Thanks for motivating me :) It was much less work than I anticipated.

You can run the new version with --output json and it will provide you with an eagerly printed jsonlines output.

I've run your benchmark notebook and OctoSQL is now just slightly slower than SPyQL (and the RAM is stable as well).

It'd be great if you could update the notebook with the above change. (btw. I've had to add !go clean -modcache when reinstalling, otherwise Go cached the versions and didn't install the new one).

The default (without --output json) still has to buffer the output, as that's necessary to print a properly-formatted ASCII table.

dcmoura commented 2 years ago

Hey @cube2222 !

I've actualy already done the necessary changes in OctoSQL that I described in the HN thread. Thanks for motivating me :) It was much less work than I anticipated.

Great!! I am happy :-) Nothing like a benchmark to bring some extra motivation ;-)

You can run the new version with --output json and it will provide you with an eagerly printed jsonlines output.

I was missing this feature in octosql :-)

I've run your benchmark notebook and OctoSQL is now just slightly slower than SPyQL (and the RAM is stable as well).

💪

It'd be great if you could update the notebook with the above change. (btw. I've had to add !go clean -modcache when reinstalling, otherwise Go cached the versions and didn't install the new one).

Of course! I should be able to do it on Monday.

The default (without --output json) still has to buffer the output, as that's necessary to print a properly-formatted ASCII table.

SpyQL does the same with pretty printing.

Thank you!!

Avogar commented 2 years ago

Hi! Your benchmark interested me and I decided to run your tests with clickhouse-local tool (the program enables you to perform fast processing on local files, without having to deploy and configure the ClickHouse server). And looks like it is much faster than the rest of presented tools. You can look at the results of your benchmark with clickhouse-local here: https://colab.research.google.com/drive/1pGmfpNdGyzE1KEXDnSSiRgxQy9bcinGZ?usp=sharing

ClickHouse github repo (if you are interested): https://github.com/ClickHouse/ClickHouse

It would be great if you could add clickhouse-local to your benchmark.

dcmoura commented 2 years ago

Hi! Your benchmark interested me and I decided to run your tests with clickhouse-local tool (the program enables you to perform fast processing on local files, without having to deploy and configure the ClickHouse server). And looks like it is much faster than the rest of presented tools. You can look at the results of your benchmark with clickhouse-local here: https://colab.research.google.com/drive/1pGmfpNdGyzE1KEXDnSSiRgxQy9bcinGZ?usp=sharing

Wow! Seems really fast!!

ClickHouse github repo (if you are interested): https://github.com/ClickHouse/ClickHouse

Of course I am! Still trying to figure how I haven't stumbled into clickhouse before... I was not aware of the option to run it locally.

It would be great if you could add clickhouse-local to your benchmark.

Of course! I will be including tools that are missing in the benchmark and repost, it's only fair to do so.

@Avogar I was trying to run clickhouse on colab and I get the following error. I have actually run your installation script. Could you help?

src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7fbe7ea09000 

I did try running it in my local machine and I am amazed with the speed, I just needed to adjust the argument input_format_max_rows_to_read_for_schema_inference that was set too low.

Thanks!

Avogar commented 2 years ago

I was trying to run clickhouse on colab and I get the following error. I have actually run your installation script. Could you help?

Colab is doing LD_PRELOAD of tcmalloc, while we are using jemalloc. To run clickhouse in colab we should run it like LD_PRELOAD='' clickhouse local ... or do !export LD_PRELOAD='' (like I did in benchmark cell before the loop)

I just needed to adjust the argument input_format_max_rows_to_read_for_schema_inference that was set too low.

Yep, the default value is 100, but the books.json dataset has a new json field image after more than 100 rows so we should adjust this setting.

dcmoura commented 2 years ago

I was trying to run clickhouse on colab and I get the following error. I have actually run your installation script. Could you help?

Colab is doing LD_PRELOAD of tcmalloc, while we are using jemalloc. To run clickhouse in colab we should run it like LD_PRELOAD='' clickhouse local ... or do !export LD_PRELOAD='' (like I did in benchmark cell before the loop)

@Avogar I was unable to put clickhouse to work using your installation method. I had to follow these instructions (From DEB Packages) https://clickhouse.com/docs/en/getting-started/install/amp/

Then the fix of setting the env variable worked.

The version installed using this method was:

ClickHouse local version 22.3.3.44 (official build).
dcmoura commented 2 years ago

Hi again @cube2222. It seems that I cannot do arithmetic expressions on octosql, or I am overlooking something? Thanks!

$ octosql --output json "SELECT reviewerName as name, reviewTime as date, overall/5.0 as score FROM sample.json"

...
Error: couldn't parse query: invalid argument syntax error at position 61 near '.0'
$ octosql --output json "SELECT reviewerName as name, reviewTime as date, overall/5 as score FROM sample.json"

...
Error: typecheck error: unknown variable: 'overall/5'
cube2222 commented 2 years ago

@dcmoura Hey, looks like a parser issue affecting division (you can totally write i.e. overall + 5.0)

I'll look into it. In the meantime, you can use the alternative notation /(overall, 5.0):

octosql --output json "SELECT reviewerName as name, reviewTime as date, /(overall, 5.0) as score FROM sample.json"

Thanks for letting me know!

dcmoura commented 2 years ago

@cube2222 @Avogar As promised, the benchmark was updated. Congratulations for the pole position(s) 😄 !

dcmoura commented 2 years ago

I asked to pin the following comment to the top of the HN Thread: https://news.ycombinator.com/item?id=31111863

This was suggested by the HN admins instead of reposting.

dcmoura commented 2 years ago

Posted the updated benchmark on Reddit r/programming

eatonphil commented 2 years ago

Hey thanks for putting the benchmarks together. One issue I notice is that you don't have an ORDER BY in the LIMIT. You aren't actually comparing like and like in this case. SQL (or at least most implementations of it) does not define any meaningful order by default and I can see that Clickhouse does in fact produce something different from what spyql produces.

This doesn't change that dsq is pretty slow. But I just wanted to point out that overall LIMIT without ORDER BY is not a reasonable benchmark and in general you may want to make sure that each tool produces the same result when you're benchmarking.

If all tools don't produce the same result you're still measuring something but I don't think it's a reasonable something from a user perspective.

cube2222 commented 2 years ago

I think it makes sense to include the use case of taking the first n rows/records/etc. of the file. It's fairly intuitive behavior and a common use case (at least in my own usage, the first thing I do with files I'm working on is head, or an analogue of it). SQL doesn't define this, but implementations are free to provide their own guarantees. And unlike dsq (or ClickHouse for that matter), SPyQL (afaik) and OctoSQL venture pretty far from the SQL standard in many ways.

That said, as @eatonphil suggests, I would keep the benchmark apples to apples - write a comment for those that don't support this behavior and not include them in that benchmark case (and possibly add another case which does the ORDER BY).

dcmoura commented 2 years ago

One issue I notice is that you don't have an ORDER BY in the LIMIT.

Yes, and this was a conscious decision. The goal is simply to understand if the tool is smart enough to avoid scanning the full dataset when only a sample is required. I know that there are no guarantees about the order of the output in standard SQL if you do not specify and ORDER BY. Still, most (if not all) database engines I know will not do a full table scan, they would stop as soon as the limit clause is satisfied. The request is: give me any N rows (and stop as soon as you have them). Putting an order by would defeat the purpose of this test.

Why is this test important? Because many times we just want to work with a sample or take a quick look or simply iterate on a query based on a sample. Tools that load all data into memory fail this test and require some kind of pre-processing (e.g. invoking head in the command-line).

In conclusion, I don't see a reason to change the benchmark. Some tools will take advantage of parallel processing, cached data, process as you go, etc, and that might result in different outputs, but that's OK in my view.