duckdb / duckdb_iceberg

MIT License
139 stars 22 forks source link

How to debug duckdb_iceberg? #77

Closed jacobmarble closed 3 days ago

jacobmarble commented 6 days ago

How can I understand why a particular Iceberg file set fails to query? Is there a "verbose logging" feature that isn't documented? How do developers of this plugin debug?


I'm working on a tool that generates Iceberg metadata and data. The generated file set looks like this:

iceberg
├── data
│   └── 5788d516-93ed-4aa8-81d1-8fe82969a2ea.parquet
└── metadata
    ├── 1728680229250371000.manifest-list.avro
    ├── 1728680229250371000.manifest.0.avro
    └── 1728680229250371000.metadata.json

select count(*) from iceberg_scan(...) yields a correct result

D select count(*) from iceberg_scan('iceberg/metadata/1728680229250371000.metadata.json') limit 5;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│          248 │
└──────────────┘

select * from iceberg_scan(...) yields the correct schema, the correct quantity of rows, but null values for every column and row

D select * from t limit 5;
┌───────────┬─────────┬─────────┬─────────────┬──────────────────┬────────────┬──────────────┬───────────┬────────────┬───────────────┬─────────────┬──────────────┬────────────┐
│   time    │   cpu   │  host   │ usage_guest │ usage_guest_nice │ usage_idle │ usage_iowait │ usage_irq │ usage_nice │ usage_softirq │ usage_steal │ usage_system │ usage_user │
│ timestamp │ varchar │ varchar │   double    │      double      │   double   │    double    │  double   │   double   │    double     │   double    │    double    │   double   │
├───────────┼─────────┼─────────┼─────────────┼──────────────────┼────────────┼──────────────┼───────────┼────────────┼───────────────┼─────────────┼──────────────┼────────────┤
│           │         │         │             │                  │            │              │           │            │               │             │              │            │
│           │         │         │             │                  │            │              │           │            │               │             │              │            │
│           │         │         │             │                  │            │              │           │            │               │             │              │            │
│           │         │         │             │                  │            │              │           │            │               │             │              │            │
│           │         │         │             │                  │            │              │           │            │               │             │              │            │
└───────────┴─────────┴─────────┴─────────────┴──────────────────┴────────────┴──────────────┴───────────┴────────────┴───────────────┴─────────────┴──────────────┴────────────┘

In contrast, select * from read_parquet(...) yields correct results

D select * from read_parquet('iceberg/data/5788d516-93ed-4aa8-81d1-8fe82969a2ea.parquet') limit 5;
┌───────────┬──────────────────────┬─────────────────────┬─────────────┬──────────────────┬───────────────────┬───┬───────────┬────────────┬───────────────┬─────────────┬────────────────────┬────────────────────┐
│    cpu    │         host         │        time         │ usage_guest │ usage_guest_nice │    usage_idle     │ … │ usage_irq │ usage_nice │ usage_softirq │ usage_steal │    usage_system    │     usage_user     │
│  varchar  │       varchar        │      timestamp      │   double    │      double      │      double       │   │  double   │   double   │    double     │   double    │       double       │       double       │
├───────────┼──────────────────────┼─────────────────────┼─────────────┼──────────────────┼───────────────────┼───┼───────────┼────────────┼───────────────┼─────────────┼────────────────────┼────────────────────┤
│ cpu-total │ Andrews-MBP.hsd1.m…  │ 2020-06-11 16:52:00 │         0.0 │              0.0 │ 97.93916821849783 │ … │       0.0 │        0.0 │           0.0 │         0.0 │ 1.1173184357541899 │ 0.9435133457479826 │
│ cpu-total │ Andrews-MBP.hsd1.m…  │ 2020-06-11 16:52:10 │         0.0 │              0.0 │ 97.70610663166448 │ … │       0.0 │        0.0 │           0.0 │         0.0 │ 1.0875679729983123 │ 1.2063253953372086 │
│ cpu-total │ Andrews-MBP.hsd1.m…  │ 2020-06-11 16:52:20 │         0.0 │              0.0 │ 97.79278434315013 │ … │       0.0 │        0.0 │           0.0 │         0.0 │  1.263052585506159 │ 0.9441630713437129 │
│ cpu-total │ Andrews-MBP.hsd1.m…  │ 2020-06-11 16:52:30 │         0.0 │              0.0 │ 96.31535998995669 │ … │       0.0 │        0.0 │           0.0 │         0.0 │ 1.6069298851296214 │ 2.0777101249136902 │
│ cpu-total │ Andrews-MBP.hsd1.m…  │ 2020-06-11 16:52:40 │         0.0 │              0.0 │  97.5298656047785 │ … │       0.0 │        0.0 │           0.0 │         0.0 │ 1.2381781981085116 │ 1.2319561971129915 │
├───────────┴──────────────────────┴─────────────────────┴─────────────┴──────────────────┴───────────────────┴───┴───────────┴────────────┴───────────────┴─────────────┴────────────────────┴────────────────────┤
│ 5 rows                                                                                                                                                                                     13 columns (12 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
mike-luabase commented 4 days ago

@jacobmarble what version are you on?

jacobmarble commented 4 days ago

Hey @mike-luabase I managed to build the plugin yesterday, and started adding printfs to figure out what's going on. Interesting that the iceberg_scan function essentially rewrites itself as parquet_scan - very elegant, easy to understand.

The problem resolves itself when I add skip_schema_inference = true, so there must be some schema mismatch between my Iceberg and Parquet data.

Back to the point of this issue: I found the skip_schema_inference parameter by looking through the source code; the feature isn't documented (unless there's documentation aside from https://duckdb.org/docs/extensions/iceberg ). But then I found that the schema parameter of parquet_scan/read_parquet is also not documented (at least not at https://duckdb.org/docs/data/parquet/overview.html#parameters ). Maybe there's a verbose logging option hidden somewhere as well.

Perhaps I can adjust my expectations of duckdb a bit. It works great when I wear my "data scientist hat" but requires a different approach when I'm wearing my "software engineer hat".