Questions about bootstrapping a db and mysql interface

anentropic commented 1 week ago

Super interested in the project, looks like exactly what I need!

I wanted to test it out locally by bootstrapping the db from a bunch of parquet files I downloaded from an RDS snapshot (I had previously tested out raw duckdb by the same method, it's easy to load these files)

I connected a volume to the /home/admin/data dir and ran the container

Locally on my host machine I was able to see the mysql.db file coming from the volume and load a parquet table into it, via Python duckdb.

If I docker exec into the container I can see the mysql.db file has the same size (0.5 GB) when viewed from inside and outside. From Python side I can connect to the file and query the data I loaded.

When I connect to the container via mysql cli client I do not get expected results.

mysql> show tables;
ERROR 1105 (HY000): no database selected

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
3 rows in set (0.04 sec)

mysql> use mysql;
ERROR 1105 (HY000): unknown error: database not found: mysql

mysql> SELECT DATABASE();
+------------+
| DATABASE() |
+------------+
| mysql      |
+------------+
1 row in set (0.05 sec)

mysql> use `mysql`;
ERROR 1105 (HY000): unknown error: database not found: mysql

mysql> show tables;
+-------------------+
| Tables_in_mysql   |
+-------------------+
| db                |
| help_category     |
| help_keyword      |
| help_relation     |
| help_topic        |
| procs_priv        |
| role_edges        |
| slave_master_info |
| tables_priv       |
| user              |
+-------------------+
10 rows in set (0.01 sec)

So the use db command behaves a bit strange. But also I can't find any of the data I loaded via Python.

In [14]: conn.execute("select current_schema()").fetchall()
Out[14]: [('main',)]

So I've created a table in the main schema as far as duckdb is concerned, but neither of these things seems to exist . I also tried creating a new schema from duckdb, but it is not visible in mysql client either.

Coming at things from the other side, I did a create database from mysql client... but how do I then select that as a catalog/schema to work with from duckdb side?

(I realise I have bypassed the intended way of using the system, but if there's any pointers you can give in this direction it would help me to evaluate it... I just want to bootstrap it via duckdb interface and then query it via mysql)

TianyuZhang1214 commented 1 week ago

Thank you for considering this feature. We believe it is highly valuable and essential. We’ll begin preparing the necessary code and documentation right away, so please stay tuned for updates.

anentropic commented 1 week ago

Awesome, thanks!

I found that using psql client instead allows me to see the table I previously created via Python+duckdb:

mysql=> \d
                      List of relations
       Schema       |        Name         | Type  |  Owner
--------------------+---------------------+-------+----------
 main               | binlog_position     | table | postgres
 performance_schema | global_status       | table | postgres
 main               | persistent_variable | table | postgres
 main               | duck_table          | table | postgres
(4 rows)

and the main schema here matches what I saw on Python+duckdb side.

Furthermore, if I create a new schema and table via psql I can see it as a 'database' in mysql, I can then use myschema and query the table.

However this doesn't translate through to duckdb interface:

In [1]: import duckdb

In [2]: conn = duckdb.connect("docker/duckdb/mysql.db")

In [4]: conn.execute("select current_catalog()").fetchall()
Out[4]: [('mysql',)]

In [6]: conn.execute("select current_schemas(true)").fetchall()
Out[6]: [(['main', 'main', 'main', 'pg_catalog'],)]

In [7]: conn.execute("select current_schema()").fetchall()
Out[7]: [('main',)]

In [9]: conn.execute("select current_schemas(false)").fetchall()
Out[9]: [([],)]

Additionally, in mysql I can use main (even though it is not listed if I query show databases;) and create a table that is visible to psql:

mysql> use main;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> create table dummy2 (id int);
Query OK, 0 rows affected (0.06 sec)

mysql> insert into dummy2 (id) values (123);
Query OK, 1 row affected (0.07 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| myschema           |
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
4 rows in set (0.02 sec)

mysql=> \d
                      List of relations
       Schema       |        Name         | Type  |  Owner
--------------------+---------------------+-------+----------
 main               | binlog_position     | table | postgres
 myschema           | dummy               | table | postgres
 main               | dummy2              | table | postgres
 performance_schema | global_status       | table | postgres
 main               | persistent_variable | table | postgres
(5 rows)

However this still isn't visible to duckdb:

In [29]: conn = duckdb.connect("docker/duckdb/mysql.db")

In [30]: conn.execute("show tables").fetchall()
Out[30]: []

In [31]: conn.execute("select current_schema()").fetchall()
Out[31]: [('main',)]

So currently I can:

add data via Python+duckdb that is visible to Postgres but not MySQL
add data via MySQL that is visible to Postgres but not Python+duckdb
add data via Postgres that is visible to MySQL but not Python+duckdb

TianyuZhang1214 commented 1 week ago

Thank you for your thoughtful reply!

Additionally, in mysql I can use main (even though it is not listed if I query show databases;) and create a table that is visible to psql:

The main database in DuckDB currently has a visibility issue when accessed through a MySQL client. I’m actively working on a fix and will update you here once it’s resolved.

add data via Python+duckdb that is visible to Postgres but not MySQL add data via MySQL that is visible to Postgres but not Python+duckdb add data via Postgres that is visible to MySQL but not Python+duckdb

These test cases are very thorough. We’ll add them to CI testing soon. Your detailed feedback is much appreciated!

anentropic commented 1 week ago

I found a way to load Parquet data that is visible to MySQL:

mount the dir containing Parquet files as a volume to myduckserver container
using psql make a query like CREATE TABLE myschema.duck_table AS SELECT * FROM 'snapshots/*.parquet' using the local path to the mounted volume within the container (i.e. query is telling the server where to load the files from, rather than sending the files from the host machine)
- this works despite Parquet loading is a DuckDB feature not supported by Postgres 👍

Loaded data is not visible to Python+duckdb, as noted previously, but using this method I can bootstrap the myduckserver and then query it via MySQL client as intended.

TianyuZhang1214 commented 1 week ago

@anentropic We’ve addressed some visible issues with the main database and added a README in PR #154 to assist with your requirements. Please give it a try and let us know if it meets your expectations. Thank you!

apecloud / myduckserver

Questions about bootstrapping a db and mysql interface #150