apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.95k stars 979 forks source link

Cannot query iceberg with 1.20.1 #2644

Open meyergin opened 2 years ago

meyergin commented 2 years ago

after upgrade version to 1.20.1, I add iceberg configure followed by the tutorial(https://drill.apache.org/docs/iceberg-format-plugin/), but seems it doesn't work which return err as following: SYSTEM ERROR: NoSuchTableException: Table does not exist at location sql:

 select * from dfs.`/db/test`

test is folder of iceberg table which includes metadata and data folders. I doubted that maybe the question is the type of catalog, does drill support hive-based catalogs or just support hadoop-based catalogs?

cgivre commented 2 years ago

@meyergin This looks like an issue with your query. The path to your file needs to be in backticks.

IE:

SELECT * 
FROM dfs.`/db/test`

You can also define workspaces which are shortcuts to file paths. (https://drill.apache.org/docs/workspaces/)

jnturton commented 2 years ago

This looks like an issue with your query. The path to your file needs to be in backticks.

The gray block in his comment makes me think that GitHub has interpreted the backticks he did use as marking up a code block.

cgivre commented 2 years ago

@jnturton I missed that. Thanks!

In any event, this looks like Drill is not seeing the fill in that folder. Here's what I'd do.

  1. Verify that the iceberg configuration is present in your dfs storage plugin config.
  2. Next I'd run a SHOW FILES IN dfs to verify that Drill can in fact see any files in that path.

If all that checks out, I think we need to do some digging.

meyergin commented 2 years ago

Sorry for the poor format, I've modified it. And I have tested dfs and s3 storage, no one works.

  1. dfs storage config:

    {
    "type": "file",
    "connection": "file:///",
    "workspaces": {
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
    },
    "formats": {
    "iceberg": {
      "type": "iceberg",
      "properties": {
        "read.split.target-size": "134217728",
        "read.split.metadata-target-size": "33554432"
      },
      "caseSensitive": true,
      "includeColumnStats": null,
      "ignoreResiduals": null,
      "snapshotId": null,
      "snapshotAsOfTime": null,
      "fromSnapshotId": null,
      "toSnapshotId": null
    }
    },
    "enabled": true
    }
  2. show files return correct data and metadata folders. image

  3. error log:

    org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at location: /db/test
    at org.apache.iceberg.hadoop.HadoopTables.load(HadoopTables.java:89)
    at org.apache.drill.exec.store.iceberg.IcebergGroupScan.initTableScan(IcebergGroupScan.java:123)
Maegor commented 1 year ago

Hi, I have problems reading an Apache Iceberg table hosted in S3 (I commented this problem on Slack but I share it here too). My Drill version is 1.21.1 (updated recently)

The Storage Plugin configuration is saved as "S3_iceberg", this is the xml:

{
  "type": "file",
  "connection": "s3a://xxx-data-lake",
  "config": {
    "fs.s3a.secret.key": "1111",
    "fs.s3a.access.key": "111"
  },
  "workspaces": {
    "default": {
      "location": "/iceberg-data-lake",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "root": {
      "location": "/iceberg-data-lake",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
  },
  "formats": {
    "iceberg": {
      "type": "iceberg",
      "properties": {
        "read.split.target-size": "536870912",
        "read.split.metadata-target-size": "33554432"
      },
      "caseSensitive": true,
      "includeColumnStats": true,
      "ignoreResiduals": null,
      "snapshotId": null,
      "snapshotAsOfTime": null,
      "fromSnapshotId": null,
      "toSnapshotId": null
    }
  },

  "authMode": "SHARED_USER",
  "enabled": true
}

The command "SHOW FILES IN s3_iceberg" returns image

"SHOW FILES IN s3_iceberg.iceberg_accumulated_exposure" returns image

When I launch this query "_ANALYZE TABLE s3_iceberg.iceberg_accumulatedexposure REFRESH METADATA;" I get image

The query "_select * from s3_iceberg.iceberg_accumulatedexposure" returns same error

image

Can you share a configuration that you know works?

Thank you.