facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.54k stars 1.16k forks source link

branch-1.1:Failed to get metadata for S3 object #8013

Open xingnailu opened 11 months ago

xingnailu commented 11 months ago

Bug description

I built gluten+velox using branch-1.1, submitted a tpch query using spark-shell, and the data was stored in s3. However, the following error occurred during execution:

Reason: Failed to get metadata for S3 object due to: 'Unknown error'. Path:'s3://xxxxxxx/user/hive/warehouse/tpch_orc.db/customer/part-00027-31ef1f3c-5b27-4f6c-aef4-7f77f7749873-c000.snappy.orc', SDK Error Type:100, HTTP Status Code:400, S3 Service:'AmazonS3', Message:'No response body.', RequestID:'KC5WQZ78QWKQ9BFX'"

But I can use gluten tag v1.0.0 version to execute normally.

@majetideepak

System information

build branch-1.1 system info:

Velox System Info v0.0.2 Commit: bbd65c4109fc11d4021334aff817ff384eab7b88 CMake Version: 3.16.3 System: Linux-5.15.0-91-generic Arch: x86_64 C++ Compiler: /bin/c++ C++ Compiler Version: 9.4.0 C Compiler: /bin/cc C Compiler Version: 9.4.0 CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

run on aws eks

Relevant logs

"2023-12-05T07:12:37.689576121Z stdout F 23/12/05 07:12:37 ERROR TaskResources: Task 8 failed by error: ",
"2023-12-05T07:12:37.689606328Z stdout F io.glutenproject.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxRuntimeError",
"2023-12-05T07:12:37.689628682Z stdout F Error Source: RUNTIME",
"2023-12-05T07:12:37.689632451Z stdout F Error Code: INVALID_STATE",
"2023-12-05T07:12:37.689636372Z stdout F Reason: Failed to get metadata for S3 object due to: 'Unknown error'. Path:'s3://xxxxxx/user/hive/warehouse/tpch_orc.db/customer/part-00027-31ef1f3c-5b27-4f6c-aef4-7f77f7749873-c000.snappy.orc', SDK Error Type:100, HTTP Status Code:400, S3 Service:'AmazonS3', Message:'No response body.', RequestID:'KC5WQZ78QWKQ9BFH'",
"2023-12-05T07:12:37.689639435Z stdout F Retriable: False",
"2023-12-05T07:12:37.689643198Z stdout F Context: Split [Hive: s3a://xxxxxx/user/hive/warehouse/tpch_orc.db/customer/part-00027-31ef1f3c-5b27-4f6c-aef4-7f77f7749873-c000.snappy.orc 0 - 121746056] Task Gluten_Stage_0_TID_8",
"2023-12-05T07:12:37.689646437Z stdout F Top-Level Context: Same as context.",
"2023-12-05T07:12:37.689649292Z stdout F Function: initialize",
"2023-12-05T07:12:37.689652406Z stdout F File: ../../velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp",
"2023-12-05T07:12:37.689655045Z stdout F Line: 93", 
"2023-12-05T07:12:37.689657984Z stdout F Stack trace:",
"2023-12-05T07:12:37.689661375Z stdout F # 0  facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)",
"2023-12-05T07:12:37.689670744Z stdout F # 1  void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxRuntimeError, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(facebook::velox::detail::VeloxCheckFailArgs const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)", 
"2023-12-05T07:12:37.68967352Z stdout F # 2  facebook::velox::(anonymous namespace)::S3ReadFile::initialize()",
"2023-12-05T07:12:37.689677103Z stdout F # 3  facebook::velox::filesystems::S3FileSystem::openFileForRead(std::basic_string_view<char, std::char_traits<char> >, facebook::velox::filesystems::FileOptions const&)",
"2023-12-05T07:12:37.689680232Z stdout F # 4  facebook::velox::FileHandleGenerator::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)",
"2023-12-05T07:12:37.689682935Z stdout F # 5  facebook::velox::CachedFactory<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<facebook::velox::FileHandle>, facebook::velox::FileHandleGenerator>::generate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)", 
"2023-12-05T07:12:37.689686275Z stdout F # 6  facebook::velox::connector::hive::HiveDataSource::addSplit(std::shared_ptr<facebook::velox::connector::ConnectorSplit>)",
"2023-12-05T07:12:37.68970488Z stdout F # 7  facebook::velox::exec::TableScan::getOutput()",
"2023-12-05T07:12:37.689707926Z stdout F # 8  facebook::velox::exec::Driver::runInternal(std::shared_ptr<facebook::velox::exec::Driver>&, std::shared_ptr<facebook::velox::exec::BlockingState>&, std::shared_ptr<facebook::velox::RowVector>&)",
"2023-12-05T07:12:37.689710953Z stdout F # 9  facebook::velox::exec::Driver::next(std::shared_ptr<facebook::velox::exec::BlockingState>&)",
"2023-12-05T07:12:37.689713812Z stdout F # 10 facebook::velox::exec::Task::next(folly::SemiFuture<folly::Unit>*)",
"2023-12-05T07:12:37.689716972Z stdout F # 11 gluten::WholeStageResultIterator::next()",
"2023-12-05T07:12:37.689719966Z stdout F # 12 Java_io_glutenproject_vectorized_ColumnarBatchOutIterator_nativeHasNext",
xingnailu commented 11 months ago

@majetideepak Could you provide some suggestions? Thank you

dcoliversun commented 11 months ago

I have similar exception but data is stored on Alibaba OSS. S3 Storage Adapters support oss scheme[1]

Exception info is

Caused by: io.glutenproject.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to get metadata for S3 object due to: 'Resource not found'. Path:'s3://henghzhen-test-hangzhou/db/t1/b=1/c=10/part-00000-d4940ed1-7f70-44f5-bbb0-65ae29f325f1.c000.snappy.parquet', SDK Error Type:16, HTTP Status Code:404, S3 Service:'AmazonS3', Message:'No response body.', RequestID:'2VQQRSWNX8QQGNNY'
Retriable: False
Context: Split [Hive: s3a://henghzhen-test-hangzhou/db/t1/b=1/c=10/part-00000-d4940ed1-7f70-44f5-bbb0-65ae29f325f1.c000.snappy.parquet 0 - 443] Task Gluten_Stage_0_TID_0
Top-Level Context: Same as context.
Function: initialize
File: ../../velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp
Line: 93
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox12_GLOBAL__N_110S3ReadFile10initializeEv
# 4  _ZN8facebook5velox11filesystems12S3FileSystem15openFileForReadESt17basic_string_viewIcSt11char_traitsIcEERKNS1_11FileOptionsE
# 5  _ZN8facebook5velox19FileHandleGeneratorclERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
# 6  _ZN8facebook5velox13CachedFactoryINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt10shared_ptrINS0_10FileHandleEENS0_19FileHandleGeneratorEE8generateERKS7_
# 7  _ZN8facebook5velox9connector4hive14HiveDataSource8addSplitESt10shared_ptrINS1_14ConnectorSplitEE
# 8  _ZN8facebook5velox4exec9TableScan9getOutputEv
# 9  _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
# 10 _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 11 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 12 _ZN6gluten24WholeStageResultIterator4nextEv
# 13 Java_io_glutenproject_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 14 0x00007f8c75018427

[1] https://facebookincubator.github.io/velox/develop/connectors.html?highlight=oss

majetideepak commented 11 months ago

@xingnailu, @dcoliversun You sometimes get more information from the aws log file when you enable hive.s3.log-level="TRACE". Can you try this? https://facebookincubator.github.io/velox/configs.html#amazon-s3-configuration

dcoliversun commented 11 months ago

@majetideepak TRACE log is here aws_sdk_2023-12-18-08.log

I want to visit oss, but sdk host is set as s3.us-east-1.amazonaws.com. I set spark configuration in gluten is here

spark.hadoop.fs.s3a.endpoint: https://oss-cn-hangzhou.aliyuncs.com
spark.hadoop.fs.s3a.access.key: <access-key>
spark.hadoop.fs.s3a.secret.key: <secret-key>
spark.hadoop.fs.s3a.path.style.access: false
spark.hadoop.fs.s3a.connection.ssl.enabled: true

What can I do to set correct endpoint in velox s3 connector?

dcoliversun commented 11 months ago

@majetideepak We have found the reason, more info is https://github.com/oap-project/velox/issues/464. Thanks for your help :)