dbt-labs / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
https://dbt-athena.github.io
Apache License 2.0
228 stars 100 forks source link

[Bug] DBT Seed in dbt-athena-community is not working for .csv with encoding utf-8 #734

Open juliodias20 opened 1 month ago

juliodias20 commented 1 month ago

Is this a new bug in dbt-core?

Current Behavior

I would like to start by apologizing if there is already a bug report on this subject, but I couldn't find it.

Lets go, I have a .csv file (sds_teste.csv) that I am using as a seed, like this: image

So, I execute the command dbt seed -s sds_teste and the dbt is successful executed image

But, when i execute a select to see the table created by dbt seed command, I can see that the table cannot read the special characters (accented letters)

image

I already try some things that I found around de internet, like to pass the encoding: utf-8, but i not found nothing that working.

My profiles.yml image

Expected Behavior

The expected behavir is that the dbt seed would can read a .csv file in encoding utf-8.

Should be: A text with special characters, like Ã, Á, Í, or Ç Instead of: A text with special characters, like �, �, �, or �

Steps To Reproduce

1 - Install the python 3.11.9 in a windows computer 2 - Create a python environment with python venv 3 - Install dbt-core==1.8.7 and dbt-athena-community==1.8.4 by pip install 4 - Create a dbt project 5 - Create a .csv file in the folder seeds/ and write some example with special characters 6 - Configure the profile.yml to connect a AWS Athena(storage: AWS S3) 7 - Run the dbt seed command

Relevant log output

============================== 13:38:20.928977 | c0ac6503-3146-44d1-84fa-552929776d79 ==============================
13:38:20.928977 [info ] [MainThread]: Running with dbt=1.8.7
13:38:20.929979 [debug] [MainThread]: running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'True', 'cache_selected_only': 'False', 'warn_error': 'None', 'fail_fast': 'False', 'profiles_dir': 'C:\\Users\\julio.padoveze\\.dbt', 'log_path': 'C:\\Users\\julio.padoveze\\github\\prd-Copastur-dbt-source-qv\\logs', 'version_check': 'True', 'debug': 'False', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'empty': 'None', 'log_format': 'default', 'introspect': 'True', 'invocation_command': 'dbt seed -s sds_teste', 'static_parser': 'True', 'target_path': 'None', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'send_anonymous_usage_stats': 'True'}
13:38:21.527840 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'project_id', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521A27B10>]}
13:38:21.561425 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x000002051CDCDE10>]}
13:38:21.563666 [info ] [MainThread]: Registered adapter: athena=1.8.4
13:38:21.575796 [debug] [MainThread]: checksum: 4af21dafb485259c48497ac86b711ddb1982f3d0f1c0ca4e09356de488b753c0, vars: {}, profile: , target: , version: 1.8.7
13:38:21.689602 [info ] [MainThread]: Unable to do partial parsing because of a version mismatch
13:38:21.690599 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'partial_parser', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521C79010>]}
13:38:22.937805 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020522118250>]}
13:38:23.152658 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020522129ED0>]}
13:38:23.152658 [info ] [MainThread]: Found 17 models, 9 seeds, 96 sources, 568 macros
13:38:23.153659 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521F006D0>]}
13:38:23.155683 [info ] [MainThread]: 
13:38:23.157687 [debug] [MainThread]: Acquiring new athena connection 'master'
13:38:23.160683 [debug] [ThreadPool]: Acquiring new athena connection 'list_awsdatacatalog'
13:38:23.161689 [debug] [ThreadPool]: Opening a new connection, currently in state init
13:38:24.834482 [debug] [ThreadPool]: On list_awsdatacatalog: Close
13:38:24.840486 [debug] [ThreadPool]: Acquiring new athena connection 'list_awsdatacatalog_db_dbt_validation'
13:38:24.840486 [debug] [ThreadPool]: Opening a new connection, currently in state init
13:38:29.476313 [debug] [ThreadPool]: On list_awsdatacatalog_db_dbt_validation: Close
13:38:29.495682 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'c0ac6503-3146-44d1-84fa-552929776d79', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000020521B74F10>]}
13:38:29.496682 [info ] [MainThread]: Concurrency: 10 threads (target='dev')
13:38:29.498679 [info ] [MainThread]: 
13:38:29.512664 [debug] [Thread-1 (]: Began running node seed.qv.sds_teste
13:38:29.514172 [info ] [Thread-1 (]: 1 of 1 START seed file db_dbt_validation.db_qvextracao__sds_teste .............. [RUN]
13:38:29.516618 [debug] [Thread-1 (]: Acquiring new athena connection 'seed.qv.sds_teste'
13:38:29.518105 [debug] [Thread-1 (]: Began compiling node seed.qv.sds_teste
13:38:29.519110 [debug] [Thread-1 (]: Began executing node seed.qv.sds_teste
13:38:29.553617 [debug] [Thread-1 (]: seed by upload...
13:38:29.566419 [debug] [Thread-1 (]: Opening a new connection, currently in state init
13:38:32.830681 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exists - Ignoring
13:38:32.832678 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist - Ignoring
13:38:32.834710 [debug] [Thread-1 (]: Dropping relation via Glue and S3 APIs
13:38:34.321827 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exists - Ignoring
13:38:34.323331 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist - Ignoring
13:38:35.891369 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" does not exist and will not be deleted, ignoring
13:38:35.893875 [debug] [Thread-1 (]: Using athena connection "seed.qv.sds_teste"
13:38:35.893875 [debug] [Thread-1 (]: On seed.qv.sds_teste: -- /* {"app": "dbt", "dbt_version": "1.8.7", "profile_name": "qv", "target_name": "dev", "node_id": "seed.qv.sds_teste"} */
create external table `db_dbt_validation`.`db_qvextracao__sds_teste__dbt_tmp` (`id` string, `some_text` string)
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    location 's3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp'
    tblproperties (
      'skip.header.line.count'='1'
    )

13:38:36.637497 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Athena query ID 87dea510-7789-40ce-849c-4d2a4d6428bf
13:38:38.371586 [debug] [Thread-1 (]: SQL status: OK -1 in 2.477 seconds
13:38:38.386232 [debug] [Thread-1 (]: Skip partitioning: False
13:38:38.387230 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: get_work_group for %s
13:38:39.758786 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: S3 path does not exist
13:38:39.759788 [debug] [Thread-1 (]: Using athena connection "seed.qv.sds_teste"
13:38:39.759788 [debug] [Thread-1 (]: On seed.qv.sds_teste: -- /* {"app": "dbt", "dbt_version": "1.8.7", "profile_name": "qv", "target_name": "dev", "node_id": "seed.qv.sds_teste"} */
create table "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste"
    with (
      table_type='hive',
      is_external=true,
      format='parquet'
    )
    as

    select

              cast(nullif("id", '') as integer) as "id"
            , 
              cast(nullif("some_text", '') as varchar) as "some_text"

    from
        "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp"

13:38:39.966615 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Athena query ID 3ff995b0-dbb9-42b2-ab05-90844aeb9925
13:38:42.972477 [debug] [Thread-1 (]: SQL status: OK 2 in 3.212 seconds
13:38:44.460039 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table_name : 490445445844.db_dbt_validation.db_qvextracao__sds_teste__dbt_tmp
13:38:44.462029 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table type : TableType.TABLE
13:38:44.462029 [debug] [Thread-1 (]: Dropping relation via Glue and S3 APIs
13:38:45.992330 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table_name : 490445445844.db_dbt_validation.db_qvextracao__sds_teste__dbt_tmp
13:38:45.993334 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: table type : TableType.TABLE
13:38:45.994409 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp" is stored in s3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp
13:38:46.756532 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Deleting table data: path='s3://prd-lakehouse-copastur-analytics/analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp', bucket='prd-lakehouse-copastur-analytics', prefix='analytics/env=prod/top=copastur/domain=analytics/dbt/db_dbt_validation/db_qvextracao__sds_teste__dbt_tmp/'
13:38:49.383198 [debug] [Thread-1 (]: dbt.adapters.athena.constants adapter: Deleted table from glue catalog: "awsdatacatalog"."db_dbt_validation"."db_qvextracao__sds_teste__dbt_tmp"

Environment

- OS: Windons 11
- Python: 3.11.9
- dbt-core: 1.8.7
- dbt-athena-community: 1.8.4

Which database adapter are you using with dbt?

other (mention it in "Additional Context")

Additional Context

No response

dbeatty10 commented 1 month ago

Thanks for reaching out @juliodias20 !

Special characters worked for me when I tried with dbt-duckdb, so this might be specific to the dbt-athena-community adapter rather than dbt-core. So I'm going to transfer this issue to that repository instead.

Example

See below for my output when using dbt-duckdb.

Create this file:

seeds/my_seed.csv

id,some_text
1,ABC
2,Ã Á Í Ç

Run these commands:

dbt seed
dbt show --inline 'select * from {{ ref("my_seed") }}'

See this output:

| id | some_text |
| -- | --------- |
|  1 | ABC       |
|  2 | Ã Á Í Ç   |
juliodias20 commented 1 month ago

Thanks for the collaboration @dbeatty10 !

Now that you said this, I tested the same case with the Databricks adapter and it works correctly! It really sounds like a problem with athena adapter.

e-quili commented 1 month ago

Hello,

It works on my side with dbt-athena on Windows. Could you please try to add this parameter to .vscode/settings.json and open a new terminal in vscode ?

{
    "terminal.integrated.env.windows": {
        "PYTHONUTF8": "1"
    }
}
juliodias20 commented 1 month ago

Hello @e-quili , thank you so much for collaboration! I tested this solution and it works!!

I will use this in my local environment for developments, but I still think that there is a bug, once that anothers adapters can identify the Encoding of the .csv file. What do you think?

CommonCrisis commented 3 weeks ago

I have a similar problem: I cannot even upload the csv file to s3 that contains these words:

Sedlišćo pódzajtšo, Dolna Łužyca, etc.

This will not work with the athena dbt adapter even if the letters are utf8. I checked all possbilities of misconfigured encodings with:

os.environ["PYTHONIOENCODING"]
sys.getfilesystemencoding()
sys.getdefaultencoding()

And they are all set to utf8.

The issue is withing agate csv_py3.py:

The function writerow() calling self.writer.writerow(row) if I add a try / except I get all the rows that cannot be processed.

So here I am actually stuck - these are the rows that cannot be processed because somehow it always checks with cp1252 which does not contain any special characters.

Here is some test data:

"Sedlitz Ost (Sedlišćo pódzajtšo)",
"Senftenberg (Zły Komorow)",
"Cottbus Hbf Calau (NL) Chóśebuz gł.dw",
"Gollmitz (NL) Chańc (Dolna Łužyca)",
"Calau (NL) Kalawa (Dolna Łužyca",
"Kolkwitz Süd Gołkojce pódpołdnjo",
"Cottbus-Sandow Chóśebuz-Žandow",
"Cottbus-Merzdorf Chóśebuz-Žylowk",
"Cottbus-Willmersdorf Nord Chóśebuz-Rogoznow pódpołnoc",

One additional note:

If I just read my csv and write it again (as dbt does) it just works:

my_seed = r"path/my_csv.csv"

with open(my_seed, encoding="utf-8") as f:
    data = csv.reader(f)

table = agate.Table.from_csv(my_seed)

table.to_csv("asd.csv")

Maybe someone can lead me in the right direction where local csv file is actually read in dbt - I cannot find the creation of the agate table.


UPDATE:

Okay it is actually Powershell causing the issues. If I use git bash it just works fine. Windows...

CommonCrisis commented 1 week ago

Okay for everyone having issues with seeds and dbt athena on Windows: Set this: $Env:PYTHONUTF8=1