duckdb / duckdb_mysql

https://duckdb.org/docs/extensions/mysql
MIT License
47 stars 11 forks source link

Wrong and inconsistent behavior when reading many NULLs #21

Closed sertainly closed 8 months ago

sertainly commented 8 months ago

What happens?

We have a MySQL database running on AWS RDS Aurora. When I make queries that filter on a DATETIME column, the results are wrong and even inconsistent.

To me it seems that duckdb somehow does not manage to convert the MySQL DATETIME entries incorrectly and just converts them to NULLs, even if there actually are values in there.

Thanks in advance for your help and let me know if/how I can support.

To Reproduce

First of all, my sincere apologies. I have not been able to reproduce the behavior on a local MySQL instance. But the behavior is so strange that I think it warrants an investigation.

Here is what I did so far:

  1. Imports
    
    import duckdb
    import polars as pl

print(f"duckdb version = {duckdb.version}") print(f"polars version = {pl.version}") """ duckdb version = 0.9.2
polars version = 0.20.3 """


Connect:

```python
duckdb_params: str = get_duckdb_params()
duckdb.execute(f"ATTACH '{duckdb_params}' AS mydb (TYPE mysql, READ_ONLY)")
duckdb.execute("USE mydb.rds_stage_01")

We have a table named some_table that looks like the following:

duckdb.sql("DESCRIBE some_table")
"""
┌───────────────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│      column_name      │ column_type │  null   │   key   │ default │ extra │
│        varchar        │   varchar   │ varchar │ varchar │ varchar │ int32 │
├───────────────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ ..................... │ XXXXXXXXX   │ XXX     │ XXXX    │ XXXX    │  XXXX │
│ ..................... │ XXXXXXXXX   │ XXX     │ XXXX    │ XXXX    │  XXXX │
│ ..................... │ XXXXXXXXX   │ XXX     │ XXXX    │ XXXX    │  XXXX │
│ deleted_at            │ TIMESTAMP   │ YES     │ NULL    │ NULL    │  NULL │
│ ..................... │ XXXXXXXXX   │ XXX     │ XXXX    │ XXXX    │  XXXX │
│ ..................... │ XXXXXXXXX   │ XXX     │ XXXX    │ XXXX    │  XXXX │
│ ..................... │ XXXXXXXXX   │ XXX     │ XXXX    │ XXXX    │  XXXX │
│ ..................... │ XXXXXXXXX   │ XXX     │ XXXX    │ XXXX    │  XXXX │
└───────────────────────────────────────────────────────────────────────────┘
"""

In the actual MySQL database, the deleted_at column is of type DATETIME.

Generally, we use the deleted_at column to determine if an entry is still relevant. So it's an important filter condition to see if there is a NULL or not. This is where duckdb seems to get confused

COUNT(*) indicates that there are zero non-null deleted_at entries in our table

count_query = """
    SELECT COUNT(*)
    FROM some_table
    WHERE deleted_at IS NOT NULL
"""
print(duckdb.sql(count_query))
"""
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            0 │
└──────────────┘
"""

But SELECT * delivers 14 rows that clearly have non-null entries in the deleted_at column

select_star_query = """
    SELECT *
    FROM some_table
    WHERE deleted_at IS NOT NULL
"""
print(duckdb.sql(select_star_query))
"""
┌───────┬───┬─────────────────────┐
│  id   │ … │     deleted_at      │
│ int32 │   │      timestamp      │
├───────┼───┼─────────────────────┤
│     6 │ … │ 2021-01-06 15:35:58 │
│    19 │ … │ 2017-11-24 16:14:48 │
│    22 │ … │ 2018-09-12 11:18:29 │
│    32 │ … │ 2018-09-17 16:26:13 │
│    84 │ … │ 2017-12-28 10:31:11 │
│   331 │ … │ 2020-02-14 08:23:12 │
│   356 │ … │ 2018-03-12 07:17:48 │
│   626 │ … │ 2017-12-28 10:33:47 │
│   713 │ … │ 2018-01-15 07:34:29 │
│   716 │ … │ 2018-01-15 07:49:01 │
│   909 │ … │ 2018-01-29 16:10:59 │
│  2282 │ … │ 2018-03-09 10:01:29 │
│  2385 │ … │ 2018-03-22 13:35:31 │
│  2387 │ … │ 2018-03-22 13:35:16 │
├───────┴───┴─────────────────────┤
│ 14 rows                         │
└─────────────────────────────────┘
"""

And if I only SELECT deleted_at, the entries are gone again

select_deleted_at_query = """
    SELECT deleted_at
    FROM some_table
    WHERE deleted_at IS NOT NULL
"""
print(duckdb.sql(select_deleted_at_query))
"""
┌────────────┐
│ deleted_at │
│ timestamp  │
├────────────┤
│   0 rows   │
└────────────┘
"""

If we run the same queries with a different MySQL client (I tried polars and MySQL Workbench) we learn that there are, in reality, 113 non-null deleted_at entries in our table which is something else entirely 😵 (I show polars here, but other mysql clients give the same result)

uri: str = get_polars_uri()
count_query = """
    SELECT COUNT(*)
    FROM some_table
    WHERE deleted_at IS NOT NULL
"""
polars_df = pl.read_database_uri(query=count_query, uri=uri)
print(polars_df)
"""
shape: (1, 1)
┌──────────┐
│ COUNT(*) │
│ ---      │
│ i64      │
╞══════════╡
│ 113      │
└──────────┘
"""

I also tried to export to parquet directly:

duckdb.execute("USE mydb.rds_stage_01")
duckdb.sql("COPY (SELECT * FROM some_table) TO 'output.parquet' (FORMAT PARQUET)")

With mixed results:

duckdb.sql("USE memory")
from_parquet_query = """
    SELECT COUNT(*)
    FROM read_parquet('output.parquet')
    WHERE deleted_at IS NOT NULL
"""
print(duckdb.sql(from_parquet_query))
"""
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           14 │
└──────────────┘
"""

polars_df_from_parquet = pl.read_parquet("output.parquet")
len(polars_df_from_parquet.filter(pl.col("deleted_at").is_not_null()))
"""
14
"""

OS:

Windows 10 / Linux Ubuntu 22.04.3

MySQL Version:

8.0

DuckDB Version:

0.9.2

DuckDB Client:

Python, CLI

Full Name:

Simon Sommer

Affiliation:

Simpego

Have you tried this on the latest main branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Mytherin commented 8 months ago

Thanks for the report! I would like to investigate this but that is difficult without a reproduction. If you can't make a reproduction with a local MySQL instance could you perhaps provide the dataset that triggers this on Aurora? This might be dependent on the exact MySQL version or configuration combined with the exact values in the column.

sertainly commented 8 months ago

Thank you for the quick response! Yeah, no, unfortunately I can't share the data, as this is company-owned.

Instead, I'll try and spin up an RDS-Aurora database that mimics the one the error occurs in my private AWS account. Hopefully I can reproduce and will also be able to share credentials for you to explore yourself.

I'll keep you posted :)

sertainly commented 8 months ago

Okay so I was able to reproduce it.

The steps I took:

  1. I generated a .csv file with problematic data (similar to our company data)
  2. I load the data from the .csv into a MySQL table
  3. I run queries that have WHERE clause on the deleted_at column
  4. As described above, I encounter some strange behavior when comparing DuckDB and other MySQL clients

It's also noteworthy that I tried the steps above in both MySQL Server on localhost and AWS RDS Aurora, which gave the same results. This means that the issue is probably not related to the type of MySQL server.

I attach the file with the data for you to explore: strange_data.csv

Also, here are the queries I ran to load the csv into my localhost server:

LOAD DATA LOCAL INFILE 'C:/path/to/strange_data.csv'
INTO TABLE strange_data
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

-- This was necessary because MySQL loads empty entries as 0 instead of NULL
SET SQL_SAFE_UPDATES = 0;
UPDATE strange_data
SET deleted_at = NULL
WHERE deleted_at = 0;

Hope this helps :)

Mytherin commented 8 months ago

Thanks for the reproduction! This has been very helpful - I've managed to track it down and have pushed what I believe to be a fix in #23.

sertainly commented 8 months ago

Ooh that was super quick, thanks so much for working on this so promptly 😊

And aah so it was about the large number of NULLs in the data and not about the datetime parsing. I'll try to reflect that in the issue title.

Fingers crossed for the solution 🤞

Mytherin commented 8 months ago

This should be fixed now in the dev build, give it a shot:

force install mysql from 'http://nightly-extensions.duckdb.org';

If it works let me know then I will release a new version of the extension with the fix.

sertainly commented 8 months ago

I just tried and it works as expected now. Thanks again for resolving 😊