StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.91k stars 1.79k forks source link

1064 - Unexpected exception: fail to create tablet: 10004: [Internal error: starlet err Create hdfs root dir #50233

Closed yeyhuan closed 2 months ago

yeyhuan commented 2 months ago

Steps to reproduce the behavior (Required)

CREATE STORAGE VOLUME hdfs_storage_volume
TYPE = HDFS
LOCATIONS = ("hdfs://xxx/user/starrocks/")
COMMENT 'emr-hdfs-def'
properties(
"dfs.nameservices" = "nameservices",
"dfs.ha.namenodes.nameservices" = "nn1,nn2",
"dfs.namenode.rpc-address.nameservices.nn1" = "xxx",
"dfs.namenode.rpc-address.nameservices.nn2" = "xxx",
"dfs.client.failover.proxy.provider.nameservices" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
"hadoop.security.authentication" = "kerberos",
"hadoop.security.kerberos.ticket.cache.path" = "/tmp/krb5cc_0"
);
CREATE TABLE IF NOT EXISTS hdfs_test (
     etl_flag TINYINT NOT NULL,
     douyin_no varchar(100) NULL DEFAULT '', 
     start_time bigint NOT NULL DEFAULT '0'
)
PROPERTIES (
    "storage_volume" = "hdfs_storage_volume", 
    "datacache.enable" = "true",
    "datacache.partition_duration" = "1 MONTH",
    "enable_async_write_back" = "false"
);

Expected behavior (Required)

Create table to pass kerberos authentication

Real behavior (Required)

error log

CREATE TABLE IF NOT EXISTS hdfs_test (
     etl_flag TINYINT NOT NULL,
     douyin_no varchar(100) NULL DEFAULT '', 
     start_time bigint NOT NULL DEFAULT '0'
)
PROPERTIES (
    "storage_volume" = "hdfs_storage_volume", 
    "datacache.enable" = "true",
    "datacache.partition_duration" = "1 MONTH",
    "enable_async_write_back" = "false"
)
> 1064 - Unexpected exception: fail to create tablet: 10008: [Internal error: starlet err Create hdfs root dir '/user/starrocks/908979b6-5632-4763-a40c-e9fa58bc9122/db10197/14243/14242' error: 权限不够: Permission denied]
> Time: 0.040s

StarRocks version (Required)

Linux Kerberos authentication is possible image

image

kevincai commented 2 months ago

check which be/cn reports the failure, and make sure the kerberos ticket cache auth works on all the be/cn nodes.

yeyhuan commented 2 months ago

The logs indicate that the issue is occurring on node 15\37\42. Here are the exception logs and the results of running HDFS commands on node 15\37\42: [Insert the log details and command results here]

error log

2024-08-28 11:01:53.700+08:00 WARN (thrift-server-pool-20|237) [LeaderImpl.finishTask():194] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:10.235.15.37, be_port:9060, http_port:8040), task_type:CREATE, signature:15449, task_status:TStatus(status_code:RUNTIME_ERROR, error_msgs:[Internal error: starlet err Create hdfs root dir '/user/starrocks/908979b6-5632-4763-a40c-e9fa58bc9122/db10197/15447/15446' error: 权限不够: Permission denied]), report_version:17240412940000)
 2024-08-28 11:01:53.701+08:00 WARN (thrift-server-pool-20|237) [LeaderImpl.finishTask():242] task type: CREATE, status_code: RUNTIME_ERROR, Internal error: starlet err Create hdfs root dir '/user/starrocks/908979b6-5632-4763-a40c-e9fa58bc9122/db10197/15447/15446' error: 权限不够: Permission denied, backendId: 10005, signature: 15449
 2024-08-28 11:01:53.701+08:00 WARN (starrocks-mysql-nio-pool-27|1895429) [LocalMetastore.waitForFinished():2131] fail to create tablet: 10005: [Internal error: starlet err Create hdfs root dir '/user/starrocks/908979b6-5632-4763-a40c-e9fa58bc9122/db10197/15447/15446' error: 权限不够: Permission denied]
 2024-08-28 11:01:53.701+08:00 WARN (thrift-server-pool-5|218) [LeaderImpl.finishTask():194] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:10.235.15.42, be_port:9060, http_port:8040), task_type:CREATE, signature:15453, task_status:TStatus(status_code:RUNTIME_ERROR, error_msgs:[Internal error: starlet err Create hdfs root dir '/user/starrocks/908979b6-5632-4763-a40c-e9fa58bc9122/db10197/15447/15446' error: 权限不够: Permission denied]), report_version:17240412930000)
 2024-08-28 11:01:53.701+08:00 WARN (thrift-server-pool-10|227) [LeaderImpl.finishTask():194] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:10.235.15.15, be_port:9060, http_port:8040), task_type:CREATE, signature:15451, task_status:TStatus(status_code:RUNTIME_ERROR, error_msgs:[Internal error: starlet err Create hdfs root dir '/user/starrocks/908979b6-5632-4763-a40c-e9fa58bc9122/db10197/15447/15446' error: 权限不够: Permission denied]), report_version:17240412930000)
 2024-08-28 11:01:53.701+08:00 WARN (thrift-server-pool-5|218) [LeaderImpl.finishTask():228] cannot find task. type: CREATE, backendId: 10008, signature: 15453
 2024-08-28 11:01:53.701+08:00 WARN (thrift-server-pool-10|227) [LeaderImpl.finishTask():228] cannot find task. type: CREATE, backendId: 10006, signature: 15451
 2024-08-28 11:01:53.701+08:00 WARN (starrocks-mysql-nio-pool-27|1895429) [StmtExecutor.handleDdlStmt():1682] DDL statement (CREATE TABLE IF NOT EXISTS hdfs_test (
     etl_flag TINYINT NOT NULL,
     douyin_no varchar(100) NULL DEFAULT '', 
     start_time bigint NOT NULL DEFAULT '0'
)
PROPERTIES (
    "storage_volume" = "hdfs_storage_volume", 
    "datacache.enable" = "true",
    "datacache.partition_duration" = "1 MONTH",
    "enable_async_write_back" = "false"
)) process failed.
 com.starrocks.common.DdlException: fail to create tablet: 10005: [Internal error: starlet err Create hdfs root dir '/user/starrocks/908979b6-5632-4763-a40c-e9fa58bc9122/db10197/15447/15446' error: 权限不够: Permission denied]
    at com.starrocks.server.LocalMetastore.waitForFinished(LocalMetastore.java:2132)
    at com.starrocks.server.LocalMetastore.sendCreateReplicaTasksAndWaitForFinished(LocalMetastore.java:2103)
    at com.starrocks.server.LocalMetastore.buildPartitionsSequentially(LocalMetastore.java:1934)
    at com.starrocks.server.LocalMetastore.buildPartitions(LocalMetastore.java:1902)
    at com.starrocks.server.OlapTableFactory.createTable(OlapTableFactory.java:605)
    at com.starrocks.server.LocalMetastore.createTable(LocalMetastore.java:843)
    at com.starrocks.server.MetadataMgr.createTable(MetadataMgr.java:271)
    at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.lambda$visitCreateTableStatement$4(DDLStmtExecutor.java:250)
    at com.starrocks.common.ErrorReport.wrapWithRuntimeException(ErrorReport.java:108)
    at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.visitCreateTableStatement(DDLStmtExecutor.java:249)
    at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.visitCreateTableStatement(DDLStmtExecutor.java:159)
    at com.starrocks.sql.ast.CreateTableStmt.accept(CreateTableStmt.java:308)
    at com.starrocks.qe.DDLStmtExecutor.execute(DDLStmtExecutor.java:145)
    at com.starrocks.qe.StmtExecutor.handleDdlStmt(StmtExecutor.java:1656)
    at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:680)
    at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:345)
    at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:539)
    at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:846)
    at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

image

image image

I have verified that HDFS commands can be executed successfully on all BE/CN nodes. I also confirmed that the Kerberos ticket cache authentication is working as expected on all the nodes. Let me know if you need any further information or assistance

kevincai commented 2 months ago

what does this node 15\37\42 ? FE nodes?

yeyhuan commented 2 months ago

IP addresses host:10.235.15.37, host:10.235.15.42, and host:10.235.15.15. These are the CN nodes

kevincai commented 2 months ago

try to check the cn.out (or jni.log) on these nodes under cn/log/ directory, if there is detailed info related to this permission error. mostly could be some java call stack there.

kevincai commented 2 months ago

how many cn nodes, does all cn nodes fail or just these 3 nodes?

yeyhuan commented 2 months ago

all cn nodes

yeyhuan commented 2 months ago

cn.out has the following exception information

hdfsOpenFile(/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10042/10057/10056/meta/000000000000274D_0000000000000B2B.meta): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error:
FileNotFoundException: No such file or directory '/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10042/10057/10056/meta/000000000000274D_0000000000000B2B.meta'java.io.FileNotFoundException: No such file or directory '/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10042/10057/10056/meta/000000000000274D_0000000000000B2B.meta'
        at org.apache.hadoop.fs.CosNFileSystem.getFileStatus(CosNFileSystem.java:617)
        at org.apache.hadoop.fs.CosNFileSystem.open(CosNFileSystem.java:838)
        at org.apache.hadoop.fs.CosFileSystem.open(CosFileSystem.java:268)
        at com.qcloud.emr.fs.TemrfsHadoopFileSystemAdapter.open(TemrfsHadoopFileSystemAdapter.java:251)
hdfsOpenFile(/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10042/10057/10056/meta/0000000000002752_0000000000000B2B.meta): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error:
FileNotFoundException: No such file or directory '/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10042/10057/10056/meta/0000000000002752_0000000000000B2B.meta'java.io.FileNotFoundException: No such file or directory '/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10042/10057/10056/meta/0000000000002752_0000000000000B2B.meta'
        at org.apache.hadoop.fs.CosNFileSystem.getFileStatus(CosNFileSystem.java:617)
        at org.apache.hadoop.fs.CosNFileSystem.open(CosNFileSystem.java:838)
        at org.apache.hadoop.fs.CosFileSystem.open(CosFileSystem.java:268)
        at com.qcloud.emr.fs.TemrfsHadoopFileSystemAdapter.open(TemrfsHadoopFileSystemAdapter.java:251)
hdfsOpenFile(/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10009/10012/12845/meta/000000000000322F_0000000000000ADC.meta): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error:
FileNotFoundException: No such file or directory '/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10009/10012/12845/meta/000000000000322F_0000000000000ADC.meta'java.io.FileNotFoundException: No such file or directory '/emr-6y2ejh20/908979b6-5632-4763-a40c-e9fa58bc9122/db10009/10012/12845/meta/000000000000322F_0000000000000ADC.meta'
        at org.apache.hadoop.fs.CosNFileSystem.getFileStatus(CosNFileSystem.java:617)
        at org.apache.hadoop.fs.CosNFileSystem.open(CosNFileSystem.java:838)
        at org.apache.hadoop.fs.CosFileSystem.open(CosFileSystem.java:268)
        at com.qcloud.emr.fs.TemrfsHadoopFileSystemAdapter.open(TemrfsHadoopFileSystemAdapter.java:251)
kevincai commented 2 months ago

these are not related, check if any permission denied related errror.

yeyhuan commented 2 months ago

No logs related to permissions were found in the cn.out file. image

cn.INFO some image

kevincai commented 2 months ago

do you have additional hadoop core-site.xml/hdfs-site.xml configuration under cn/conf?

yeyhuan commented 2 months ago

Already solved, it was the lack of hadoop core-site.xml/hdfs-site.xml