cube2222 / octosql

OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
Mozilla Public License 2.0
4.75k stars 201 forks source link

Would love some input regarding a benchmark i'm doing #250

Closed harelba closed 3 years ago

harelba commented 3 years ago

Hi!

I'm the creator of q, a tool that has some similar behaviour to part of octosql.

I've been running a benchmark on q's speed, and also added octosql (and textql actually) to the benchmark, mainly for curiosity reasons.

From my benchmark results, it seems that octosql has relatively large execution durations when compared to the other tools.

I would really appreciate it if you could take a look at the benchmark and tell if the way I'm running octosql is wrong, or if there's perhaps some configuration I'm missing that might prevent it from reaching it's maximum potential speed. If you think something is wrong with the testing method itself, I would be glad to know as well.

Here's an example config file i'm generating during the benchmark:

dataSources:
  - name: bmdata
    type: csv
    config:
      path: "./_benchmark_data/_benchmark_data__lines_1000000_columns_100.csv"
      headerRow: false
      batchSize: 10000

I'm currently in the middle of running it with a batch size of 100,000 to see if it makes a significant difference, will update here when results are ready.

The command that the benchmark runs is as follows:

octosql -c <config-file> -o batch-csv "select count(*) from bmdata a"

The current benchmark results with more details are here.

Btw, I saw that you're right in a middle of an-almost complete rewrite, so I obviously would not mind if you prefer that octosql will be added to the benchmark only in the future.

Thanks! octosql looks like a great tool Harel

cube2222 commented 3 years ago

Hey! Feel free to add OctoSQL to the benchmark! It all seems correct. We're aware that currently we're very lacking in performance, which is caused by the storage which adds other features though. It's true we'll probably be going back on that choice, but whenever we'll improve in performance I'll just make a PR to your repo with new benchmark results 🙂

Cheers, Jacob

harelba commented 3 years ago

That would be awesome @cube2222

Thanks a lot for checking!