Open dcmoura opened 2 years ago
I'm curious as to why libraries like ultrajson and orjson weren't explored. They aren't command line tools, but neither is pandas right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, python -c "..."
)?
I'm curious as to why libraries like ultrajson and orjson weren't explored. They aren't command line tools, but neither is pandas right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie,
python -c "..."
)?
Thank you @davidatbu! The idea was to focus on querying tools. ujson and orjson (as well as the json module from python's standard library) offer json decoding and decoding but not a querying language: you need to implement the query logic in Python, resulting in large programs with lots of boilerplate. Still, I agree that Pandas is an outlier... it was included due to its popularity for querying datasets.
I should mention that spyql leverages orjson, which has a considerable impact on performance. spyql supports both the json module from the standard library as well as orjson as json decoder/encoder. Performance wise, for 1GB of input data, orjson allows to decrease processing time by 20-30%. So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.
You can find more info about the performance impact of orjson in https://github.com/dcmoura/spyql/pull/70.
resulting in large programs with lots of boilerplate
That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.
Thank you for doing this benchmark! I've been using jq
, but will probably switch to spyql
now.
There's more to my reply on HN, (regarding the title of the shared link on HN), but I don't think that's relevant here, so I've left it out.
Hey @dcmoura!
I've actualy already done the necessary changes in OctoSQL that I described in the HN thread. Thanks for motivating me :) It was much less work than I anticipated.
You can run the new version with --output json
and it will provide you with an eagerly printed jsonlines output.
I've run your benchmark notebook and OctoSQL is now just slightly slower than SPyQL (and the RAM is stable as well).
It'd be great if you could update the notebook with the above change. (btw. I've had to add !go clean -modcache
when reinstalling, otherwise Go cached the versions and didn't install the new one).
The default (without --output json
) still has to buffer the output, as that's necessary to print a properly-formatted ASCII table.
Hey @cube2222 !
I've actualy already done the necessary changes in OctoSQL that I described in the HN thread. Thanks for motivating me :) It was much less work than I anticipated.
Great!! I am happy :-) Nothing like a benchmark to bring some extra motivation ;-)
You can run the new version with
--output json
and it will provide you with an eagerly printed jsonlines output.
I was missing this feature in octosql :-)
I've run your benchmark notebook and OctoSQL is now just slightly slower than SPyQL (and the RAM is stable as well).
💪
It'd be great if you could update the notebook with the above change. (btw. I've had to add
!go clean -modcache
when reinstalling, otherwise Go cached the versions and didn't install the new one).
Of course! I should be able to do it on Monday.
The default (without
--output json
) still has to buffer the output, as that's necessary to print a properly-formatted ASCII table.
SpyQL does the same with pretty printing.
Thank you!!
Hi! Your benchmark interested me and I decided to run your tests with clickhouse-local
tool (the program enables you to perform fast processing on local files, without having to deploy and configure the ClickHouse server). And looks like it is much faster than the rest of presented tools. You can look at the results of your benchmark with clickhouse-local here: https://colab.research.google.com/drive/1pGmfpNdGyzE1KEXDnSSiRgxQy9bcinGZ?usp=sharing
ClickHouse github repo (if you are interested): https://github.com/ClickHouse/ClickHouse
It would be great if you could add clickhouse-local to your benchmark.
Hi! Your benchmark interested me and I decided to run your tests with
clickhouse-local
tool (the program enables you to perform fast processing on local files, without having to deploy and configure the ClickHouse server). And looks like it is much faster than the rest of presented tools. You can look at the results of your benchmark with clickhouse-local here: https://colab.research.google.com/drive/1pGmfpNdGyzE1KEXDnSSiRgxQy9bcinGZ?usp=sharing
Wow! Seems really fast!!
ClickHouse github repo (if you are interested): https://github.com/ClickHouse/ClickHouse
Of course I am! Still trying to figure how I haven't stumbled into clickhouse before... I was not aware of the option to run it locally.
It would be great if you could add clickhouse-local to your benchmark.
Of course! I will be including tools that are missing in the benchmark and repost, it's only fair to do so.
@Avogar I was trying to run clickhouse on colab and I get the following error. I have actually run your installation script. Could you help?
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7fbe7ea09000
I did try running it in my local machine and I am amazed with the speed, I just needed to adjust the argument input_format_max_rows_to_read_for_schema_inference
that was set too low.
Thanks!
I was trying to run clickhouse on colab and I get the following error. I have actually run your installation script. Could you help?
Colab is doing LD_PRELOAD of tcmalloc, while we are using jemalloc. To run clickhouse in colab we should run it like LD_PRELOAD='' clickhouse local ...
or do !export LD_PRELOAD=''
(like I did in benchmark cell before the loop)
I just needed to adjust the argument input_format_max_rows_to_read_for_schema_inference that was set too low.
Yep, the default value is 100, but the books.json dataset has a new json field image
after more than 100 rows so we should adjust this setting.
I was trying to run clickhouse on colab and I get the following error. I have actually run your installation script. Could you help?
Colab is doing LD_PRELOAD of tcmalloc, while we are using jemalloc. To run clickhouse in colab we should run it like
LD_PRELOAD='' clickhouse local ...
or do!export LD_PRELOAD=''
(like I did in benchmark cell before the loop)
@Avogar I was unable to put clickhouse to work using your installation method. I had to follow these instructions (From DEB Packages) https://clickhouse.com/docs/en/getting-started/install/amp/
Then the fix of setting the env variable worked.
The version installed using this method was:
ClickHouse local version 22.3.3.44 (official build).
Hi again @cube2222. It seems that I cannot do arithmetic expressions on octosql, or I am overlooking something? Thanks!
$ octosql --output json "SELECT reviewerName as name, reviewTime as date, overall/5.0 as score FROM sample.json"
...
Error: couldn't parse query: invalid argument syntax error at position 61 near '.0'
$ octosql --output json "SELECT reviewerName as name, reviewTime as date, overall/5 as score FROM sample.json"
...
Error: typecheck error: unknown variable: 'overall/5'
@dcmoura Hey, looks like a parser issue affecting division (you can totally write i.e. overall + 5.0
)
I'll look into it. In the meantime, you can use the alternative notation /(overall, 5.0)
:
octosql --output json "SELECT reviewerName as name, reviewTime as date, /(overall, 5.0) as score FROM sample.json"
Thanks for letting me know!
@cube2222 @Avogar As promised, the benchmark was updated. Congratulations for the pole position(s) 😄 !
I asked to pin the following comment to the top of the HN Thread: https://news.ycombinator.com/item?id=31111863
This was suggested by the HN admins instead of reposting.
Posted the updated benchmark on Reddit r/programming
Hey thanks for putting the benchmarks together. One issue I notice is that you don't have an ORDER BY in the LIMIT. You aren't actually comparing like and like in this case. SQL (or at least most implementations of it) does not define any meaningful order by default and I can see that Clickhouse does in fact produce something different from what spyql produces.
This doesn't change that dsq is pretty slow. But I just wanted to point out that overall LIMIT without ORDER BY is not a reasonable benchmark and in general you may want to make sure that each tool produces the same result when you're benchmarking.
If all tools don't produce the same result you're still measuring something but I don't think it's a reasonable something from a user perspective.
I think it makes sense to include the use case of taking the first n rows/records/etc. of the file. It's fairly intuitive behavior and a common use case (at least in my own usage, the first thing I do with files I'm working on is head
, or an analogue of it). SQL doesn't define this, but implementations are free to provide their own guarantees. And unlike dsq (or ClickHouse for that matter), SPyQL (afaik) and OctoSQL venture pretty far from the SQL standard in many ways.
That said, as @eatonphil suggests, I would keep the benchmark apples to apples - write a comment for those that don't support this behavior and not include them in that benchmark case (and possibly add another case which does the ORDER BY).
One issue I notice is that you don't have an ORDER BY in the LIMIT.
Yes, and this was a conscious decision. The goal is simply to understand if the tool is smart enough to avoid scanning the full dataset when only a sample is required. I know that there are no guarantees about the order of the output in standard SQL if you do not specify and ORDER BY. Still, most (if not all) database engines I know will not do a full table scan, they would stop as soon as the limit clause is satisfied. The request is: give me any N rows (and stop as soon as you have them). Putting an order by would defeat the purpose of this test.
Why is this test important? Because many times we just want to work with a sample or take a quick look or simply iterate on a query based on a sample. Tools that load all data into memory fail this test and require some kind of pre-processing (e.g. invoking head
in the command-line).
In conclusion, I don't see a reason to change the benchmark. Some tools will take advantage of parallel processing, cached data, process as you go, etc, and that might result in different outputs, but that's OK in my view.
Leave your comments for the JSON benchmark here.