duckdb / duckdb_mysql

https://duckdb.org/docs/extensions/mysql
MIT License
55 stars 13 forks source link

Add `mysql_max_threads` extension option #9

Closed Kayrnt closed 1 year ago

Kayrnt commented 1 year ago

To support multi threading, in the extension, it's hard to find a "magic number" so letting the user pick the number felt like a first step in the right direction.

@Mytherin, this isn't fixing parallelism across multiple attached DBs so I'm digging into parallelSource but I'm not sure yet it's the blocker.

Mytherin commented 1 year ago

Thanks for the PR!

This isn't quite the right approach, however. MaxThreads in this case refers exclusively to the threads used within a single scan. That has no effect here since a single scan always only consists of running a single query. In the Postgres extension we partition single scans through ctids, but MySQL does not have similar support. We could try to do partitioning based on primary keys but that would be more involved and likely require a lot more user-configuration.

Multiple scans are always executed separately and in parallel and do not need this setting.

Kayrnt commented 1 year ago

Hum ok I thought that it was going to allow multiple threads within the extension and therefore maybe enable multiple scans in parallel.

Multiple scans are always executed separately and in parallel and do not need this setting.

I tried on one db to do a self join for testing purpose on a ~1M rows table and the scan is requested twice and done sequentially. The requests looks like:

SELECT 
  t1.field1, 
  t1.field2, 
  t1.field3
FROM 
  mydb.my_table t1 
  JOIN mydb.my_table t2 ON t1.field3 = t2.field3 
  AND t1.field2 = t2.field2 
  AND t1.field1 = t2.field1 
GROUP BY 
  1, 
  2, 
  3 
ORDER BY 
  1 
LIMIT 
  100;

I could see that the underlying request:

SELECT `field1`, `field2`, `field3` FROM `mydb`.`my_table`

is done twice and sequentially which is the part taking the most time of the request (the full request takes 26s and the duckdb part is like ~2s according to the profiling)

Am I missing something or there's a limitation for the 2 select not be run concurrently (besides the fact that we could expect that we manage to "cache" the result from the former request for the latter but it's likely a edge case in production cases).

Mytherin commented 1 year ago

That has to do with dependencies between components of a query. In case of a self-join there is a dependency between the hash join where the probing can only start after the building has completed. If you want two scans to run concurrently there needs to be no dependencies between the two, for example if you have multiple joins all builds can be run in parallel.

Kayrnt commented 1 year ago

So you mean that DuckDB is capable of figuring out it's a self join even if the extension is actually reading the table twice and the profiling shows the table at the same level? I'll do more tests then to check if it works well with 2 different tables on my end.

Kayrnt commented 1 year ago

I tested by adding a log at scan done and you're indeed right, it's working as intended. Even in the self join case, I've mistakenly thought it was sequential as it was slow but it's just that it's "slow" to fetch twice. Let's close that PR while we can figure out a while to parallelize a scan (though it's not as clear as for Postgres).