apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.24k stars 2.17k forks source link

ugi not correct in WORKER_POOL #10639

Closed lurnagao-dahua closed 2 months ago

lurnagao-dahua commented 3 months ago

Apache Iceberg version

1.4.3

Query engine

Hive 3.1.3

Please describe the bug 🐞

1.B user execution query select * from iceberg_tb_b.This is a simple grab will not run job. 2.A user execution query select * from iceberg_tb_a. This is a simple grab will not run job and the error log is : Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to open input stream for file: hdfs://hdfsHACluster/user/hive/warehouse/yc_iceberg.db/iceberg_tb_A/metadata/a72f8bf5-5d93-405b-953e-a8fed8bfa6b6-m0.avro at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:507) ... Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=B, access=EXECUTE, inode="/user/hive/warehouse":hadoop:supergroup:drwx------

iceberg-core module has a static global thread pool WORKER_POOL and the specific location is org.apache.iceberg.util.ThreadPools

DataTableScan.doPlanFiles:

public CloseableIterable<FileScanTask> doPlanFiles() {
   ...
  if (shouldPlanWithExecutor() && (dataManifests.size() > 1 || deleteManifests.size() > 1)) {
    manifestGroup = manifestGroup.planWith(planExecutor());
 }
  return manifestGroup.planFiles();
}

Thread in worker_pool keeps its initial user information? Should we set system properties iceberg.scan.plan-in-worker-pool=fasle to disable worker_pool in hiveserver2?

pvary commented 3 months ago

@deniskuzZ: What do you think?

lurnagao-dahua commented 2 months ago

similar issues 2754

lurnagao-dahua commented 2 months ago

Hi, May I ask if you can help me check this issue? @deniskuzZ @nastra i would be very grateful if you have any response!

deniskuzZ commented 2 months ago

@lurnagao-dahua, is this a thread pool in question: https://github.com/apache/hive/commit/45867be6cb5308566e4cf16c7b4cf8081085b58c? cc @zhangbutao

is there an easy repro so we could try in-house?

lurnagao-dahua commented 2 months ago

@lurnagao-dahua, is this a thread pool in question: apache/hive@45867be? cc @zhangbutao

is there an easy repro so we could try in-house?

Thank you for your reply! the work pool defined in iceberg-core,The specific location is org.apache.iceberg.util.ThreadPools

As long as different users are used for querying in hive, it is very easy to reproduce

deniskuzZ commented 2 months ago

@lurnagao-dahua, what version of Hive are you using?

lurnagao-dahua commented 2 months ago

@lurnagao-dahua, what version of Hive are you using?

Thank you for your reply! hive 3.1.3 and I added more information in the description now.

zhangbutao commented 2 months ago

@lurnagao-dahua, is this a thread pool in question: apache/hive@45867be? cc @zhangbutao

is there an easy repro so we could try in-house?

@deniskuzZ Haven't looked into the ugi problem yet. But https://github.com/apache/hive/commit/45867be6cb5308566e4cf16c7b4cf8081085b58c has nothing to do with this problem. it just make the thread pool size configurable; even without this change, iceberg-core will still use the thread pool when hive calls iceberg method scan.planTasks().

deniskuzZ commented 2 months ago

@lurnagao-dahua, is this a thread pool in question: apache/hive@45867be? cc @zhangbutao is there an easy repro so we could try in-house?

@deniskuzZ Haven't looked into the ugi problem yet. But apache/hive@45867be has nothing to do with this problem. it just make the thread pool size configurable; even without this change, iceberg-core will still use the thread pool when hive calls iceberg method scan.planTasks().

https://github.com/apache/hive/commit/45867be6cb5308566e4cf16c7b4cf8081085b58c should fix the problem, pool is recreated for every scan. the same thing is proposed here