duckdb / duckdb_mysql

MIT License
45 stars 10 forks source link

Optimize performance for `COPY (SELECT * FROM che_test limit 1) TO 'che_test_1.parquet';` #29

Closed BleethNie closed 7 months ago

BleethNie commented 7 months ago

What happens?

I want to export data from a che_test table to parquet, this table has 800w+ data, but my query export statement is COPY (SELECT * FROM che_test limit 1) TO 'che_test_1.parquet', this export speed should be very fast, but the actual execution takes 16s, which needs to be optimized

I have 3000W+ mysql data in my production environment, I want to export a parquet file every day according to the business time, the amount of data is about 200W+ per day, when I export these data, it requires 10G memory, and the speed is very slow,Also, after exporting the data from the 2000W+ table, I found that the total number of data is not exactly the same as in the database, and the value of the last field is missing, but I can't be sure that this is a bug。

To Reproduce

python code

import duckdb

sql = '''
LOAD mysql;
ATTACH 'host=localhost user=root password=123456 port=3306 database=che' AS mysqldb (TYPE mysql);
USE mysqldb;
COPY (SELECT * FROM che_test  limit 1) TO 'che_test_1.parquet';
'''

duck_df = duckdb.sql(sql)

OS:

Windows 10

MySQL Version:

8.0.22

DuckDB Version:

0.9.2

DuckDB Client:

Python

Full Name:

Bleeth

Affiliation:

developer

Have you tried this on the latest main branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Mytherin commented 7 months ago

Thanks for the report! I've pushed a fix in #30 that enables support for pushing LIMIT/OFFSET into MySQL scans which should solve the performance issue on the provided query.