Xtra-Computing / FedTree

A tree-based federated learning system (MLSys 2023)
https://fedtree.readthedocs.io/en/latest/index.html
Apache License 2.0
140 stars 38 forks source link

not working with grpc-1.53.0 & server still waiting after finish training #63

Open lidh15 opened 1 year ago

lidh15 commented 1 year ago

the documentation mentioned that grpc earlier than 1.50 may not work, I used the latest release 1.53, and making throws error:

``` [ 20%] Building CXX object src/FedTree/CMakeFiles/FedTree_DIST.dir/scikit_fedtree.cpp.o In file included from /usr/local/include/absl/base/config.h:86, from /usr/local/include/absl/base/const_init.h:25, from /usr/local/include/absl/synchronization/mutex.h:67, from /usr/local/include/grpcpp/impl/sync.h:30, from /usr/local/include/grpcpp/impl/codegen/sync.h:25, from /usr/local/include/grpcpp/completion_queue.h:43, from /usr/local/include/grpcpp/channel.h:25, from /usr/local/include/grpcpp/grpcpp.h:52, from /workspace/FedTree/include/FedTree/FL/distributed_party.h:8, from /workspace/FedTree/src/FedTree/FL/distributed_party.cpp:5: /usr/local/include/absl/base/policy_checks.h:79:2: error: #error "C++ versions less than C++14 are not supported." 79 | #error "C++ versions less than C++14 are not supported." | ^~~~~ In file included from /usr/local/include/absl/base/config.h:86, from /usr/local/include/absl/base/const_init.h:25, from /usr/local/include/absl/synchronization/mutex.h:67, from /usr/local/include/grpcpp/impl/sync.h:30, from /usr/local/include/grpcpp/impl/codegen/sync.h:25, from /usr/local/include/grpcpp/completion_queue.h:43, from /usr/local/include/grpcpp/channel.h:25, from /usr/local/include/grpcpp/grpcpp.h:52, from /workspace/FedTree/include/FedTree/FL/distributed_server.h:8, from /workspace/FedTree/src/FedTree/FL/distributed_server.cpp:5: /usr/local/include/absl/base/policy_checks.h:79:2: error: #error "C++ versions less than C++14 are not supported." 79 | #error "C++ versions less than C++14 are not supported." | ^~~~~ In file included from /usr/local/include/absl/time/time.h:88, from /usr/local/include/absl/time/clock.h:26, from /usr/local/include/absl/synchronization/internal/kernel_timeout.h:35, from /usr/local/include/absl/synchronization/mutex.h:74, from /usr/local/include/grpcpp/impl/sync.h:30, from /usr/local/include/grpcpp/impl/codegen/sync.h:25, from /usr/local/include/grpcpp/completion_queue.h:43, from /usr/local/include/grpcpp/channel.h:25, from /usr/local/include/grpcpp/grpcpp.h:52, from /workspace/FedTree/include/FedTree/FL/distributed_party.h:8, from /workspace/FedTree/src/FedTree/FL/distributed_party.cpp:5: /usr/local/include/absl/strings/string_view.h: In member function ‘constexpr void absl::lts_20230125::string_view::remove_prefix(absl::lts_20230125::string_view::size_type) const’: /usr/local/include/absl/strings/string_view.h:340:10: error: assignment of member ‘absl::lts_20230125::string_view::ptr_’ in read-only object 340 | ptr_ += n; | ~~~~~^~~~ /usr/local/include/absl/strings/string_view.h:341:13: error: assignment of member ‘absl::lts_20230125::string_view::length_’ in read-only object 341 | length_ -= n; | ~~~~~~~~^~~~ /usr/local/include/absl/strings/string_view.h:338:18: error: invalid return type ‘void’ of ‘constexpr’ function ‘constexpr void absl::lts_20230125::string_view::remove_prefix(absl::lts_20230125::string_view::size_type) const’ 338 | constexpr void remove_prefix(size_type n) { | ^~~~~~~~~~~~~ /usr/local/include/absl/strings/string_view.h: In member function ‘constexpr void absl::lts_20230125::string_view::remove_suffix(absl::lts_20230125::string_view::size_type) const’: /usr/local/include/absl/strings/string_view.h:350:13: error: assignment of member ‘absl::lts_20230125::string_view::length_’ in read-only object 350 | length_ -= n; | ~~~~~~~~^~~~ /usr/local/include/absl/strings/string_view.h:348:18: error: invalid return type ‘void’ of ‘constexpr’ function ‘constexpr void absl::lts_20230125::string_view::remove_suffix(absl::lts_20230125::string_view::size_type) const’ 348 | constexpr void remove_suffix(size_type n) { | ^~~~~~~~~~~~~ /usr/local/include/absl/strings/string_view.h: In member function ‘constexpr void absl::lts_20230125::string_view::swap(absl::lts_20230125::string_view&) const’: /usr/local/include/absl/strings/string_view.h:358:13: error: passing ‘const absl::lts_20230125::string_view’ as ‘this’ argument discards qualifiers [-fpermissive] 358 | *this = s; | ^ /usr/local/include/absl/strings/string_view.h:161:7: note: in call to ‘absl::lts_20230125::string_view& absl::lts_20230125::string_view::operator=(const absl::lts_20230125::string_view&)’ 161 | class string_view { | ^~~~~~~~~~~ /usr/local/include/absl/strings/string_view.h:356:18: error: invalid return type ‘void’ of ‘constexpr’ function ‘constexpr void absl::lts_20230125::string_view::swap(absl::lts_20230125::string_view&) const’ 356 | constexpr void swap(string_view& s) noexcept { | ^~~~ In file included from /usr/local/include/absl/time/time.h:88, from /usr/local/include/absl/time/clock.h:26, from /usr/local/include/absl/synchronization/internal/kernel_timeout.h:35, from /usr/local/include/absl/synchronization/mutex.h:74, from /usr/local/include/grpcpp/impl/sync.h:30, from /usr/local/include/grpcpp/impl/codegen/sync.h:25, from /usr/local/include/grpcpp/completion_queue.h:43, from /usr/local/include/grpcpp/channel.h:25, from /usr/local/include/grpcpp/grpcpp.h:52, from /workspace/FedTree/include/FedTree/FL/distributed_server.h:8, from /workspace/FedTree/src/FedTree/FL/distributed_server.cpp:5: /usr/local/include/absl/strings/string_view.h: In member function ‘constexpr void absl::lts_20230125::string_view::remove_prefix(absl::lts_20230125::string_view::size_type) const’: /usr/local/include/absl/strings/string_view.h:340:10: error: assignment of member ‘absl::lts_20230125::string_view::ptr_’ in read-only object 340 | ptr_ += n; | ~~~~~^~~~ /usr/local/include/absl/strings/string_view.h:341:13: error: assignment of member ‘absl::lts_20230125::string_view::length_’ in read-only object 341 | length_ -= n; | ~~~~~~~~^~~~ /usr/local/include/absl/strings/string_view.h:338:18: error: invalid return type ‘void’ of ‘constexpr’ function ‘constexpr void absl::lts_20230125::string_view::remove_prefix(absl::lts_20230125::string_view::size_type) const’ 338 | constexpr void remove_prefix(size_type n) { | ^~~~~~~~~~~~~ /usr/local/include/absl/strings/string_view.h: In member function ‘constexpr void absl::lts_20230125::string_view::remove_suffix(absl::lts_20230125::string_view::size_type) const’: /usr/local/include/absl/strings/string_view.h:350:13: error: assignment of member ‘absl::lts_20230125::string_view::length_’ in read-only object 350 | length_ -= n; | ~~~~~~~~^~~~ /usr/local/include/absl/strings/string_view.h:348:18: error: invalid return type ‘void’ of ‘constexpr’ function ‘constexpr void absl::lts_20230125::string_view::remove_suffix(absl::lts_20230125::string_view::size_type) const’ 348 | constexpr void remove_suffix(size_type n) { | ^~~~~~~~~~~~~ /usr/local/include/absl/strings/string_view.h: In member function ‘constexpr void absl::lts_20230125::string_view::swap(absl::lts_20230125::string_view&) const’: /usr/local/include/absl/strings/string_view.h:358:13: error: passing ‘const absl::lts_20230125::string_view’ as ‘this’ argument discards qualifiers [-fpermissive] 358 | *this = s; | ^ /usr/local/include/absl/strings/string_view.h:161:7: note: in call to ‘absl::lts_20230125::string_view& absl::lts_20230125::string_view::operator=(const absl::lts_20230125::string_view&)’ 161 | class string_view { | ^~~~~~~~~~~ /usr/local/include/absl/strings/string_view.h:356:18: error: invalid return type ‘void’ of ‘constexpr’ function ‘constexpr void absl::lts_20230125::string_view::swap(absl::lts_20230125::string_view&) const’ 356 | constexpr void swap(string_view& s) noexcept { | ^~~~ [ 21%] Linking CXX shared library ../../lib/libFedTree.so make[2]: *** [src/FedTree/CMakeFiles/FedTree_DIST.dir/build.make:146: src/FedTree/CMakeFiles/FedTree_DIST.dir/FL/distributed_party.cpp.o] Error 1 make[2]: *** Waiting for unfinished jobs.... /usr/bin/ld: /usr/local/lib/libntl.a(ZZ.o): relocation R_X86_64_TPOFF32 against `_ZN3NTLL8iodigitsE' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /usr/local/lib/libntl.a(fileio.o): relocation R_X86_64_TPOFF32 against `_ZZN3NTL8UniqueIDB5cxx11EvE37_ntl_hidden_variable_tls_local_ptr_ID' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /usr/local/lib/libntl.a(lip.o): relocation R_X86_64_TPOFF32 against `_ZZ10_ntl_gswapPP17_ntl_gbigint_bodyS1_E36_ntl_hidden_variable_tls_local_ptr_t' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /usr/local/lib/libntl.a(tools.o): relocation R_X86_64_TPOFF32 against symbol `_ZN3NTL16ErrorMsgCallbackE' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /usr/local/lib/libntl.a(thread.o): relocation R_X86_64_TPOFF32 against `_ZZN3NTL15CurrentThreadIDB5cxx11EvE37_ntl_hidden_variable_tls_local_ptr_ID' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /usr/local/lib/libntl.a(BasicThreadPool.o): relocation R_X86_64_TPOFF32 against `_ZZN3NTLL49_ntl_hidden_function_tls_access_NTLThreadPool_stgEvE52_ntl_hidden_variable_tls_local_ptr_NTLThreadPool_stg' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /usr/local/lib/libntl.a(lip.o): warning: relocation against `_ZTV21_ntl_tmp_vec_crt_fast' in read-only section `.text' collect2: error: ld returned 1 exit status make[2]: *** [src/FedTree/CMakeFiles/FedTree.dir/build.make:551: lib/libFedTree.so] Error 1 make[1]: *** [CMakeFiles/Makefile2:154: src/FedTree/CMakeFiles/FedTree.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs.... make[2]: *** [src/FedTree/CMakeFiles/FedTree_DIST.dir/build.make:160: src/FedTree/CMakeFiles/FedTree_DIST.dir/FL/distributed_server.cpp.o] Error 1 make[1]: *** [CMakeFiles/Makefile2:232: src/FedTree/CMakeFiles/FedTree_DIST.dir/all] Error 2 [ 23%] Linking CXX static library ../../lib/libft_grpc_proto.a [ 24%] Built target ft_grpc_proto make: *** [Makefile:91: all] Error 2 ```

seems that it came from the latest absl.

lidh15 commented 1 year ago

okay, it's not about absl, update CMakeLists.txt from c++11 to c++14 fixed it, but it is about zliib, the errors are:

/usr/bin/ld: /usr/local/lib/libgrpc.a(message_compress.cc.o): in function `zlib_compress(grpc_slice_buffer*, grpc_slice_buffer*, int)':
message_compress.cc:(.text+0x541): undefined reference to `deflateInit2_'
/usr/bin/ld: message_compress.cc:(.text+0x58b): undefined reference to `deflate'
/usr/bin/ld: message_compress.cc:(.text+0x660): undefined reference to `deflateEnd'
/usr/bin/ld: /usr/local/lib/libgrpc.a(message_compress.cc.o): in function `zlib_decompress(grpc_slice_buffer*, grpc_slice_buffer*, int)':
message_compress.cc:(.text+0x701): undefined reference to `inflateInit2_'
/usr/bin/ld: message_compress.cc:(.text+0x747): undefined reference to `inflate'
/usr/bin/ld: message_compress.cc:(.text+0x7ee): undefined reference to `inflateEnd'
collect2: error: ld returned 1 exit status
make[2]: *** [src/FedTree/CMakeFiles/FedTree-distributed-party.dir/build.make:164: bin/FedTree-distributed-party] Error 1
make[1]: *** [CMakeFiles/Makefile2:259: src/FedTree/CMakeFiles/FedTree-distributed-party.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
/usr/bin/ld: /usr/local/lib/libgrpc.a(message_compress.cc.o): in function `zlib_compress(grpc_slice_buffer*, grpc_slice_buffer*, int)':
message_compress.cc:(.text+0x541): undefined reference to `deflateInit2_'
/usr/bin/ld: message_compress.cc:(.text+0x58b): undefined reference to `deflate'
/usr/bin/ld: message_compress.cc:(.text+0x660): undefined reference to `deflateEnd'
/usr/bin/ld: /usr/local/lib/libgrpc.a(message_compress.cc.o): in function `zlib_decompress(grpc_slice_buffer*, grpc_slice_buffer*, int)':
message_compress.cc:(.text+0x701): undefined reference to `inflateInit2_'
/usr/bin/ld: message_compress.cc:(.text+0x747): undefined reference to `inflate'
/usr/bin/ld: message_compress.cc:(.text+0x7ee): undefined reference to `inflateEnd'
collect2: error: ld returned 1 exit status
make[2]: *** [src/FedTree/CMakeFiles/FedTree-distributed-server.dir/build.make:164: bin/FedTree-distributed-server] Error 1
make[1]: *** [CMakeFiles/Makefile2:286: src/FedTree/CMakeFiles/FedTree-distributed-server.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
QinbinLi commented 1 year ago

Hi @lidh15 ,

We use grpc 1.50.0 to generate the proto files. If you use a version other than 1.50.0, you may need to go to src/FedTree/grpc directory and run the following commands. Then you can try to compile the library. Thank you!

protoc -I ./ --grpc_out=. --plugin=protoc-gen-grpc=`which grpc_cpp_plugin` ./fedtree.proto
protoc -I ./ --cpp_out=. ./fedtree.proto
lidh15 commented 1 year ago

okay, I'll try.

lidh15 commented 1 year ago

I don't know if it is okay to discuss in this issue or I should start a new one: why the distributed server won't exit after a vertical gbdt training process? I know in original horizontal federated learning architecture it is believed to be a service, but in vertical scenarios "server" usually is also a "party" but only with label, will it be possible that "distributed-party" takes server's job and exit after a training task?

QinbinLi commented 1 year ago

Thank you for this great suggestion! Indeed it'd be better if the server stops automatically when the task is over. We'll fix it in the future.

lidh15 commented 1 year ago

Hi @lidh15 ,

We use grpc 1.50.0 to generate the proto files. If you use a version other than 1.50.0, you may need to go to src/FedTree/grpc directory and run the following commands. Then you can try to compile the library. Thank you!

protoc -I ./ --grpc_out=. --plugin=protoc-gen-grpc=`which grpc_cpp_plugin` ./fedtree.proto
protoc -I ./ --cpp_out=. ./fedtree.proto

this didn't help

lidh15 commented 1 year ago

and one more question, how many bits of N is used in paillier HE for vertical GBDT? Typically it is 2048, but I didn't see this description in the documentation.

QinbinLi commented 1 year ago

512 bits are used in the default setting. I just added the parameter key_length so that users can control the bits. Please refer to https://fedtree.readthedocs.io/en/latest/Parameters.html for details.

For grpc 1.53.0, I have no idea why it fails. I'm considering adding a feature to automatically install a fixed version of grpc when compiling FedTree to avoid the grpc compatibility issue.