duckdb / duckdb_mysql

MIT License
45 stars 10 forks source link

Add `workload` param to MySQL connection string #57

Closed ricardonunez-io closed 4 months ago

ricardonunez-io commented 5 months ago

For MySQL users with single-query row limits (i.e. Vitess and/or Planetscale users), the workload session variable allows them to bypass the 100,000 row limit so that DuckDB can properly load tables into memory and/or on disk without manually having to write batch queries to create a table in DuckDB from a table of 100s of thousands or millions of rows.

Mytherin commented 4 months ago

Thanks for the PR!

Perhaps I'm not following entirely, but this seems to just append the workload parameter to the host parameter. Would this not already be possible by connecting using e.g. ATTACH 'mysql:host=www.myhost.com?workload=.. ...' AS mysqldb. Could you expand a bit on why adding support for this specific setting is required/desirable?

ricardonunez-io commented 4 months ago

Hi Mark! Sure thing, currently, the extension looks for a connection string in the format of 'key=value key=value ...', with each key/value pair separated by spaces.

If you try to append session variables to the end of the hostname/database name instead (e.g. ATTACH 'hostname=localhost?workload=olap ...' or ATTACH 'hostname=localhost database=mysql?workload=olap ...'), the key/value pairs are no longer separated strictly by spaces, but rather one of the key/value pairs is attached to another in query-param format, so the current parsing method returns an error as it was expecting a space before the next key=value not a ?. Specifically, it returns Unrecognized configuration parameter "".

Also, for reference, using mysql_execute to run something like SET workload = olap doesn't work either, because this specific session variable is not available in general MySQL (which is how this extension connects), just in the Vitess flavor.

However, I figured adding another connection parameter to the string would be both more simple (does not require a change in connection string-parsing method which is the same, to my knowledge, as the Postgres extension) and self documenting (Vitess/Planetscale users would have the variable visible in the shell).

Otherwise, if it makes more sense, I can submit a PR to change the string parsing method from expecting a series of both space separated key/value pairs and = separated keys and values to splitting the connection string by spaces first, and then again by the first = operator, making the entire second half of the =-split string the connection variable to pass to mysql_real_connect. This will allow for arbitrary session variables to be added, but will make the parsing method separate than that of the Postgres extension, which might be undesirable for other maintainers, not sure.

Also, just saw the tests fail, I'll add a commit to fix that, that was a previous version.

Thank you! And big fan of DuckDB Labs!

Mytherin commented 4 months ago

Thanks - that makes sense. I think reworking the parsing of configuration parameters so that it supports these types of parameters is cleaner. I've recently actually cleaned up the parsing so that it allows quoted values - see https://github.com/duckdb/duckdb_mysql/pull/48. Perhaps that already allows this to be supported?

ricardonunez-io commented 4 months ago

No problem! Ah I didn't see that PR, that does allow this to be supported, yes, but I just tested it out, and it looks like session variables can't be appended to the database name/host this way, it looks like it might have to be done using another mysql C API. (I had vcpkg problems on my machine, even on a freshly cloned repo, and couldn't build it to test locally)

I'll look into it and submit a more general purpose PR for adding arbitrary session variables to the connection string.

Thank you!