dbeatty10 / dbt-mysql

dbt-mysql contains all of the code enabling dbt to work with MySQL and MariaDB
Apache License 2.0
75 stars 53 forks source link

Clarification: Data Type display on docs serve using mysql adapter #108

Open clementchong opened 2 years ago

clementchong commented 2 years ago

Describe the bug

Apologies, we do not know if this is a bug yet. Hence like to check with the rest of users who may be able to help. After docs generate and serve, we see that the data type are displayed with b'xxxx'. Please see screenshot below:

image

We do see the same in the generated target/catalog.json file. Not sure if this is adapter specific or configuration issue but we searched and could not find a solution.

Appreciate some guidance on this. Thank you!

Steps To Reproduce

In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

The output of dbt --version: dbt-ver

The operating system you're using:

Ubuntu 22.04.1 LTS

The output of python --version: 3.9.13

Additional context

Add any other context about the problem here.

dbeatty10 commented 2 years ago

@clementchong thanks for opening this issue!

It sounds like target/catalog.json file contains a Python bytes literal rather than a str literal (see here and here for further explanation).

Basically, it has encoded data that needs to be decoded in order to render properly.

Which database and database version are you using (e.g., MySQL 8.0.32)? Do you know what your "Connection Character Set and Collation System Variable" settings are?

character_set_results is "the character set used for returning query results to the client. This includes result data such as column values, result metadata such as column names, and error messages." I'm wondering if your setting might explain the bytes encoding like b'varchar(8)', etc.

clementchong commented 2 years ago

@dbeatty10 Thank you for the quick and detailed overview. Our MySQL version is 8.0.25. The variable values are as below:

SmartSelect_20220831_133825_Gallery

Do you see anything wrong? Or could someone share their versions if not facing such display issue?

clementchong commented 2 years ago

I found the below discussion, would try out.

https://stackoverflow.com/questions/40921341/python-mysql-connector-returns-bytearray-instead-of-regular-string-value

clementchong commented 2 years ago

I also found the information_schema.columns table having binary collation for the data_type field. However, I am unable to change it.

image

moszutij commented 1 year ago

I also ran into this issue and it appears to be targeted at MySQL 8.0. The data_type column's collation in information_schema.columns is binary in this version, while it is non-binary in MySQL 5.7 and MariaDB 10.5. When a column is defined with binary collation, the Python MySQL connector returns the column value as a byte.

The collation settings for each database versions are as follows:

DATABASE DATA_TYPE CHARACTER_SET_NAME COLLATION_NAME
mysql:5.7 varchar utf8 utf8_general_ci
mysql:8.0 longtext utf8mb3 utf8mb3_bin
mariadb:10.5 varchar utf8 utf8_general_ci

The different collation settings for each database version can be observed by running the following commands and changing the DOCKER_IMAGE environment variables:

$ DOCKER_IMAGE=mysql:8.0; export DOCKER_IMAGE
$ read -d '' QUERY_TEXT << EOF
select column_name, 
       data_type,
       character_set_name,
       collation_name
  from information_schema.columns 
 where table_schema = 'information_schema' 
   and table_name = 'COLUMNS' 
   and column_name = 'DATA_TYPE';
EOF

$ docker run --name col-data-type-collation-check -e MYSQL_ALLOW_EMPTY_PASSWORD=yes -d ${DOCKER_IMAGE}

# Wait for database to start up then:
$ docker exec col-data-type-collation-check mysql -e "${QUERY_TEXT}"
COLUMN_NAME DATA_TYPE   CHARACTER_SET_NAME  COLLATION_NAME
DATA_TYPE   longtext    utf8mb3                 utf8mb3_bin

# Tidy up
$ docker stop col-data-type-collation-check && docker rm col-data-type-collation-check && docker image rm ${DOCKER_IMAGE}

A proposed fix is to remove the b character prefix from dtype in the MySQLAdapter parse_show_columns function as follows:

import re
...
MySQLColumn(
    table_database=None,
    table_schema=relation.schema,
    table_name=relation.name,
    table_type=relation.type,
    table_owner=None,
    table_stats=None,
    column=column.column,
    column_index=idx,
    # dtype=column.dtype,
    dtype = re.sub(r"^b'|'$", "", column.dtype)
 )
clementchong commented 1 year ago

Dear Jeff

Thank you for the quick explanation and proposed fix. It is very helpful!

Regards Clement

On Sun, May 7, 2023, 00:32 Jeff Moszuti @.***> wrote:

I also ran into this issue and it appears to be targeted at MySQL 8.0. The data_type column's collation in information_schema.columns is binary in this version, while it is non-binary in MySQL 5.7 and MariaDB 10.5. When a column is defined with binary collation, the Python MySQL connector returns the column value as a byte.

The collation settings for each database versions are as follows: DATABASE DATA_TYPE CHARACTER_SET_NAME COLLATION_NAME mysql:5.7 varchar utf8 utf8_generalci mysql:8.0 longtext utf8mb3 utf8mb3bin mariadb:10.5 varchar utf8 utf8_general_ci

The different collation settings for each database version can be observed by running the following commands and changing the DOCKER_IMAGE environment variables:

$ DOCKER_IMAGE=mysql:8.0; export DOCKER_IMAGE $ read -d '' QUERY_TEXT << EOF select column_name, data_type, character_set_name, collation_name from information_schema.columns where table_schema = 'information_schema' and table_name = 'COLUMNS' and column_name = 'DATA_TYPE'; EOF

$ docker run --name col-data-type-collation-check -e MYSQL_ALLOW_EMPTY_PASSWORD=yes -d ${DOCKER_IMAGE}

Wait for database to start up then:

$ docker exec col-data-type-collation-check mysql -e "${QUERY_TEXT}" COLUMN_NAME DATA_TYPE CHARACTER_SET_NAME COLLATION_NAME DATA_TYPE longtext utf8mb3 utf8mb3_bin

Tidy up

$ docker stop col-data-type-collation-check && docker rm col-data-type-collation-check && docker image rm ${DOCKER_IMAGE}

A proposed fix is to remove the b character prefix from dtype in the MySQLAdapter parse_show_columns https://github.com/dbeatty10/dbt-mysql/blob/main/dbt/adapters/mysql/impl.py#L97-L113 function as follows:

import re ... MySQLColumn( table_database=None, table_schema=relation.schema, table_name=relation.name, table_type=relation.type, table_owner=None, table_stats=None, column=column.column, column_index=idx,

dtype=column.dtype,

dtype = re.sub(r"^b'|'$", "", column.dtype)

)

— Reply to this email directly, view it on GitHub https://github.com/dbeatty10/dbt-mysql/issues/108#issuecomment-1537177062, or unsubscribe https://github.com/notifications/unsubscribe-auth/AX6SX27QPBLV5PBDRFYB723XEZ4JZANCNFSM6AAAAAAQAILRU4 . You are receiving this because you were mentioned.Message ID: @.***>