cube2222 / octosql

OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
Mozilla Public License 2.0
4.75k stars 201 forks source link

Table scan optimization #269

Open Allam76 opened 2 years ago

Allam76 commented 2 years ago

I noticed that there is no table scan order optimization. That is, tables in joins are scanned by increasing size.

An example from the northwind dataset:

Products have 100 entries. Orders have 10000 entries.

Join products on orders should scan products first and then orders with appropriate filter from found products.

Today the first declared table is scanned first. First come, first served :-)

To make this work the optimizer must know the table size. This requires an addition to the plugin API.

I can take a stab at it but first I would need the above API change and some unit tests to hook up to.

Generally, optimization is probably the main pain point in database design. Previous work can probably be leveraged.

cube2222 commented 2 years ago

Hey!

Thanks for raising the issue! It's been on my radar. In practice, it doesn't matter if you're using a stream join (the default join algorithm), cause those store everything in memory right now anyways. If you're using lookup joins then indeed it does matter, and as you say, currently it just uses the join ordering the user does. It gets even more complicated with 3, 4, 5-way join etc. ordering and there's loads of interesting papers about that.

There's a bunch of metadata I'm planning to add (and also add to the plugin API), including the size, but also whether retractions are possible or not (as that's another big source of potential optimizations).

That said, I'm first concentrating on usability and ergonomics (stuff like the JSON handling you mentioned in the other issue).

I can take a stab at it but first I would need the above API change and some unit tests to hook up to.

I'm glad you're willing to help, however, as is mentioned in the README, I won't be able to accept any contributions for another 6-12 months for a bunch of formal reasons I don't want to get into here.

OctoSQL doesn't accept external contributions to its source code right now, but you can raise issues or develop external plugins for database types you'd like OctoSQL to support. Create a Pull Request to add a plugin to the core plugins repository (which is contained in the plugin_repository.json file).

Allam76 commented 2 years ago

Hello.

I'm first concentrating on usability and ergonomics

Focus on the fundamental stuff. Much of the small stuff can quite easily be delegated to contributors. Many tools, like for example alasql, do simple sql table scans. What makes octosql stand out is the extensibility and possibility of growth to something more powerful.

I won't be able to accept any contributions for another 6-12 months

OK that is important to know. Then we cannot expect any API stability before then and hence better to wait. Keep us posted on the development and let us know if there is anything we can do to help.

Really appreciate your work and I cross my fingers you have time to continue! Cheers