apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
665 stars 477 forks source link

ORC-1669: [C++] Deprecate HDFS support #1885

Closed wgtmac closed 2 months ago

wgtmac commented 2 months ago

What changes were proposed in this pull request?

Mark readHdfsFile as deprecated.

Why are the changes needed?

Reading ORC on HDFS was introduced in https://github.com/apache/orc/pull/134 without any test. It has not been updated for 7 years and updating libhdfspp will result in extra dependency like boost library. Staying at an old version of libhdfspp will also prohibit us from updating other libraries like protobuf.

How was this patch tested?

It does not need test.

Was this patch authored or co-authored using generative AI tooling?

No.

wgtmac commented 2 months ago

cc @dongjoon-hyun @stiga-huang @ffacs

wgtmac commented 2 months ago

Also cc @omalley in case there is any blocker to this.

dongjoon-hyun commented 2 months ago

cc @williamhyun , @pavibhai , @deshanxiao , too.

wgtmac commented 2 months ago

Does it mean that we cannot upgrade libprotobuf like https://github.com/apache/orc/pull/1857 in 2.x? Or at least we can choose appropriate protobuf version according to the BUILD_LIBHDFSPP option. @dongjoon-hyun

stiga-huang commented 2 months ago

+1 for choosing the protobuf version based on the BUILD_LIBHDFSPP option. It might have less complaints.

dongjoon-hyun commented 2 months ago

I'm not sure what you mean by that. May I ask what is broken in libprotobuf PR, #1857, specifically , @wgtmac ?

Does it mean that we cannot upgrade libprotobuf like #1857 in 2.x? Or at least we can choose appropriate protobuf version according to the BUILD_LIBHDFSPP option.

Apache ORC 2 has two-layer protections.

FYI, Semantic Versioning is defined here. You can do whatever you want as long as you follow the policy. In this PR, my concern is the part of a backward compatible manner of this policy. In general, Removal is not an option for backward compatibility in general. That's the reason why I told you that we can do only Deprecatation. Deprecation doesn't mean Removal.

MAJOR version when you make incompatible API changes MINOR version when you add functionality in a backward compatible manner PATCH version when you make backward compatible bug fixes

wgtmac commented 2 months ago

https://github.com/apache/orc/pull/1857 only changes from 3.5.1 to 3.21.12. The major version is the same. Do you mean they break anything there?

Yes, libhdfspp does not link with 3.21.12. For example:

[ 84%] Linking CXX executable orc-test
/usr/bin/ld: ../libs/thirdparty/libhdfspp_ep-install/lib/libhdfspp_static.a(namenode_operations.cc.o): in function `hadoop::hdfs::GetFsStatusRequestProto::~GetFsStatusRequestProto()':
/root/build/libhdfspp_ep-prefix/src/libhdfspp_ep-build/lib/proto/ClientNamenodeProtocol.pb.h:9833: undefined reference to `google::protobuf::internal::ZeroFieldsBase::~ZeroFieldsBase()'
/usr/bin/ld: ../libs/thirdparty/libhdfspp_ep-install/lib/libhdfspp_static.a(namenode_operations.cc.o): in function `hadoop::hdfs::DisallowSnapshotResponseProto::~DisallowSnapshotResponseProto()':
/root/build/libhdfspp_ep-prefix/src/libhdfspp_ep-build/lib/proto/ClientNamenodeProtocol.pb.h:23777: undefined reference to `google::protobuf::internal::ZeroFieldsBase::~ZeroFieldsBase()'
/usr/bin/ld: ../libs/thirdparty/libhdfspp_ep-install/lib/libhdfspp_static.a(namenode_operations.cc.o): in function `hadoop::hdfs::AllowSnapshotResponseProto::~AllowSnapshotResponseProto()':
/root/build/libhdfspp_ep-prefix/src/libhdfspp_ep-build/lib/proto/ClientNamenodeProtocol.pb.h:23487: undefined reference to `google::protobuf::internal::ZeroFieldsBase::~ZeroFieldsBase()'
/usr/bin/ld: ../libs/thirdparty/libhdfspp_ep-install/lib/libhdfspp_static.a(namenode_operations.cc.o): in function `hadoop::hdfs::RenameSnapshotResponseProto::~RenameSnapshotResponseProto()':
/root/build/libhdfspp_ep-prefix/src/libhdfspp_ep-build/lib/proto/ClientNamenodeProtocol.pb.h:23197: undefined reference to `google::protobuf::internal::ZeroFieldsBase::~ZeroFieldsBase()'
/usr/bin/ld: ../libs/thirdparty/libhdfspp_ep-install/lib/libhdfspp_static.a(namenode_operations.cc.o): in function `hadoop::hdfs::DeleteSnapshotResponseProto::~DeleteSnapshotResponseProto()':
/root/build/libhdfspp_ep-prefix/src/libhdfspp_ep-build/lib/proto/ClientNamenodeProtocol.pb.h:24090: undefined reference to `google::protobuf::internal::ZeroFieldsBase::~ZeroFieldsBase()'
/usr/bin/ld: ../libs/thirdparty/libhdfspp_ep-install/lib/libhdfspp_static.a(namenode_operations.cc.o):/root/build/libhdfspp_ep-prefix/src/libhdfspp_ep-build/lib/proto/ClientNamenodeProtocol.pb.h:4055: more undefined references to `google::protobuf::internal::ZeroFieldsBase::~ZeroFieldsBase()' follow
/usr/bin/ld: ../libs/thirdparty/libhdfspp_ep-install/lib/libhdfspp_static.a(ClientNamenodeProtocol.pb.cc.o):
dongjoon-hyun commented 2 months ago

Although we don't want to touch that file, I guess we are able to update the source code like we did in the following PR. WDTY?

wgtmac commented 2 months ago

Right, I feel very reluctant to touch that file. I'll look into it later.

wgtmac commented 2 months ago

Thanks @dongjoon-hyun!