PKUHPC / CraneSched

A High Performance HPC and Cloud Computing Fused Job Scheduling System
Other
71 stars 15 forks source link

针对EL8家族系统更新安装配置文档 #237

Closed MidsummerNight closed 5 months ago

MidsummerNight commented 6 months ago

部署环境:AlmaLinux 8.9

说明文档中需要调整的内容:

  1. EL8家族系统使用chrony取代了ntp,因此安装时钟的部分应当改为dnf install chrony
  2. 需要开启额外的软件源才能安装依赖项libcgroup-devel boost169-devel boost169-static zlib-devel zlib-staticdnf config-manager --set-enabled powertoolsdnf install almalinux-release-develdnf install epel-release,参照AlmaLinux官方Wiki
  3. AlmaLinux官方源中似乎没有devtoolset-11,使用AlmaLinux Appstream中的gcc-toolset代之:dnf install install gcc-toolset-11(参见红帽官方文档)。
  4. AlmaLinux 8.9默认Git版本为2.39CMake版本为3.26,满足调度器最低需求,因此无需额外下载编译CMake和安装rh-git218两个工具。
  5. 由于使用gcc-toolset代替dev-toolset,使用scl enable gcc-toolset-11 bash创建一个使用GCC 11的Shell会话(而不是source scl_source enable devtoolset-11)。
  6. 由于使用gcc-toolset代替dev-toolset,首次编译时,CMakeNinja的配置选项也要相应改成cmake -G Ninja -DCMAKE_C_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/gcc -DCMAKE_CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/g++ -DBoost_INCLUDE_DIR=/usr/include/boost169/ -DBoost_LIBRARY_DIR=/usr/lib64/boost169/ ..

部署过程中遇到的问题: 我们从GitHub上下载了CraneSched的源码压缩包进行编译。执行cmake -G Ninja -DCMAKE_C_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/gcc -DCMAKE_CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/g++ -DBoost_INCLUDE_DIR=/usr/include/boost169/ -DBoost_LIBRARY_DIR=/usr/lib64/boost169/ ..时,遇到以下报错:

[sysadmin@el8 build]$ cmake -G Ninja -DCMAKE_C_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/gcc -DCMAKE_CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/g++ -DBoost_INCLUDE_DIR=/usr/include/boost169/ -DBoost_LIBRARY_DIR=/usr/lib64/boost169/ ..
-- The C compiler identification is GNU 11.2.1
-- The CXX compiler identification is GNU 11.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- colorized output for gcc is enabled
-- -march=native enabled
-- All targets: concurrentqueue;pevents;result
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
CMake Error at dependencies/cmake/BSThreadPool/CMakeLists.txt:5 (MESSAGE):
  Thread pool library haven't been synchronized to gitee.  Set
  CRANE_USE_GITEE_SOURCE to OFF.

-- Configuring incomplete, errors occurred!

尝试在Shell中执行export CRANE_USE_GITEE_SOURCE=OFF,报错相同:

[sysadmin@el8 build]$ export CRANE_USE_GITEE_SOURCE=OFF
[sysadmin@el8 build]$ cmake -G Ninja -DCMAKE_C_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/gcc -DCMAKE_CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/g++ -DBoost_INCLUDE_DIR=/usr/include/boost169/ -DBoost_LIBRARY_DIR=/usr/lib64/boost169/ ..
-- colorized output for gcc is enabled
-- -march=native enabled
-- All targets: concurrentqueue;pevents;result
CMake Error at dependencies/cmake/BSThreadPool/CMakeLists.txt:5 (MESSAGE):
  Thread pool library haven't been synchronized to gitee.  Set
  CRANE_USE_GITEE_SOURCE to OFF.

尝试在命令中添加-CRANE_USE_GITEE_SOURCE=OFF,报出以下错误,CRANE_USE_GITEE_SOURCE不知为何被识别成了RANE_USE_GITEE_SOURCE:

[sysadmin@el8 build]$ cmake -G Ninja -DCMAKE_C_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/gcc -DCMAKE_CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/g++ -DBoost_INCLUDE_DIR=/usr/include/boost169/ -DBoost_LIBRARY_DIR=/usr/lib64/boost169/
 -CRANE_USE_GITEE_SOURCE=OFF ..
loading initial cache file RANE_USE_GITEE_SOURCE=OFF
CMake Error: Error processing file: /home/sysadmin/CraneSched-master/build/RANE_USE_GITEE_SOURCE=OFF
-- colorized output for gcc is enabled
-- -march=native enabled
-- All targets: concurrentqueue;pevents;result
CMake Error at dependencies/cmake/BSThreadPool/CMakeLists.txt:5 (MESSAGE):
  Thread pool library haven't been synchronized to gitee.  Set
  CRANE_USE_GITEE_SOURCE to OFF.

-- Configuring incomplete, errors occurred!

请问这个问题出在哪里?应该如何解决?

Ashlee1994 commented 6 months ago

我们已经开发了针对集群安装的一键化自动部署方案,目前已经在centos和ubuntu系统下进行了完整性测试,近期会发布,使用工具可以指定主控节点和计算节点ip和常用配置参数,就能完成集群的所有配置安装

L-Xiafeng commented 6 months ago

尝试在命令中添加-CRANE_USE_GITEE_SOURCE=OFF,报出以下错误,CRANE_USE_GITEE_SOURCE不知为何被识别成了RANE_USE_GITEE_SOURCE

将-CRANE_USE_GITEE_SOURCE=OFF 改成 -DCRANE_USE_GITEE_SOURCE=OFF 应该能解决你的问题。

RileyWen commented 6 months ago

安装文档确实疏于维护,现在比如已经不依赖boost了。Gitee最近没有更新依赖包,加上依赖版本更新,所以把Gitee未更新的地方加了个Error,已把gitee默认选项关闭。 #238

MidsummerNight commented 6 months ago

您们好,将-CRANE_USE_GITEE_SOURCE=OFF改成-DCRANE_USE_GITEE_SOURCE=OFF后确实解决了前述问题,但又报出如下错误:

CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find LibAIO (missing: LIBAIO_LIBRARY LIBAIO_INCLUDE_DIR)
Call Stack (most recent call first):
  /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  CMakeModule/FindLibAIO.cmake:8 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:223 (find_package)

-- Configuring incomplete, errors occurred!
[sysadmin@el8 build]$

完整版的终端输出信息请见附件。 terminal_output.log

L-Xiafeng commented 6 months ago

您们好,将-CRANE_USE_GITEE_SOURCE=OFF改成-DCRANE_USE_GITEE_SOURCE=OFF后确实解决了前述问题,但又报出如下错误:

CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find LibAIO (missing: LIBAIO_LIBRARY LIBAIO_INCLUDE_DIR)
Call Stack (most recent call first):
  /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  CMakeModule/FindLibAIO.cmake:8 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:223 (find_package)

-- Configuring incomplete, errors occurred!
[sysadmin@el8 build]$

完整版的终端输出信息请见附件。 terminal_output.log

在你的环境中用dnf或者源码安装libaio库应该能解决这个问题

MidsummerNight commented 6 months ago

您们好,安装libaio-devel后,上述问题得到了解决,但是出现了大量形如以下报告的错误(区别仅在于CMake Error所在的位置):

CMake Error at src/CraneCtld/CMakeLists.txt:1 (add_executable):
  The install of the cranectld target requires changing an RPATH from the
  build tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.

完整终端输出请参见terminal_output_2.log

L-Xiafeng commented 6 months ago

您们好,安装libaio-devel后,上述问题得到了解决,但是出现了大量形如以下报告的错误(区别仅在于CMake Error所在的位置):

CMake Error at src/CraneCtld/CMakeLists.txt:1 (add_executable):
  The install of the cranectld target requires changing an RPATH from the
  build tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.

完整终端输出请参见terminal_output_2.log

可以使用make编译或者不要使用install

MidsummerNight commented 6 months ago

您们好,感谢回复,CraneSched/CraneSched/build下均没有MakeFile,无法运行make。至于“不要使用install”,指的是把CraneSched/CMakeLists.txt末尾的install binariesInstall configuration files段落全部注释掉,再运行cmake吗?

L-Xiafeng commented 6 months ago

您们好,感谢回复,CraneSched/CraneSched/build下均没有MakeFile,无法运行make。至于“不要使用install”,指的是把CraneSched/CMakeLists.txt末尾的install binariesInstall configuration files段落全部注释掉,再运行cmake吗?

cmake -G Ninja -DCMAKE_C_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/gcc -DCMAKE_CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/g++ -DBoost_INCLUDE_DIR=/usr/include/boost169/ -DBoost_LIBRARY_DIR=/usr/lib64/boost169/ .. 中的 -G Ninja去掉,清理cmake生成文件并重新运行cmake就会生成makefile

MidsummerNight commented 6 months ago

按照上述操作运行cmake,遇到了Could NOT find SASL2Could NOT find libbfdCould NOT find libdwarfCould NOT find Gnuplot的错误,最终出现Configuring incomplete, errors occurred!的失败信息。在执行sudo dnf install cyrus-sasl-devel binutils-devel libdwarf libdwarf-devel gnuplot解决。

解决上述问题后,出现了两个现象:

  1. 终端输出这样的错误:

    CMake Error at /home/sysadmin/CraneSched-master/build/_deps/backward-subbuild/backward-populate-prefix/tmp/backward-populate-gitupdate.cmake:97 (execute_process):
    execute_process failed command indexes:
    
    1: "Child return code: 128"

    这个错误似乎是偶发的,有时发生,有时不发生。

  2. 不发生上述错误时,又会出现与之前缺少libaio时相同的报错:

您们好,安装libaio-devel后,上述问题得到了解决,但是出现了大量形如以下报告的错误(区别仅在于CMake Error所在的位置):

CMake Error at src/CraneCtld/CMakeLists.txt:1 (add_executable):
  The install of the cranectld target requires changing an RPATH from the
  build tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.

完整终端输出请参见terminal_output_2.log

两种情况的完整输出,请分别参见 terminal_output_3_case_1.logterminal_output_3_case_2.log

L-Xiafeng commented 6 months ago

按照上述操作运行cmake,遇到了Could NOT find SASL2Could NOT find libbfdCould NOT find libdwarfCould NOT find Gnuplot的错误,最终出现Configuring incomplete, errors occurred!的失败信息。在执行sudo dnf install cyrus-sasl-devel binutils-devel libdwarf libdwarf-devel gnuplot解决。

解决上述问题后,出现了两个现象:

  1. 终端输出这样的错误:
CMake Error at /home/sysadmin/CraneSched-master/build/_deps/backward-subbuild/backward-populate-prefix/tmp/backward-populate-gitupdate.cmake:97 (execute_process):
  execute_process failed command indexes:

    1: "Child return code: 128"

这个错误似乎是偶发的,有时发生,有时不发生。 2. 不发生上述错误时,又会出现与之前缺少libaio时相同的报错:

您们好,安装libaio-devel后,上述问题得到了解决,但是出现了大量形如以下报告的错误(区别仅在于CMake Error所在的位置):

CMake Error at src/CraneCtld/CMakeLists.txt:1 (add_executable):
  The install of the cranectld target requires changing an RPATH from the
  build tree, but this is not supported with the Ninja generator unless on an
  ELF-based or XCOFF-based platform.  The CMAKE_BUILD_WITH_INSTALL_RPATH
  variable may be set to avoid this relinking step.

完整终端输出请参见terminal_output_2.log

两种情况的完整输出,请分别参见 terminal_output_3_case_1.logterminal_output_3_case_2.log

第一种情况是你的网络问题。建议将build目录删除重新cmake,如果可以建议使用centos7系统

MidsummerNight commented 6 months ago

您们好,删除、新建build目录后重新cmake,在build目录下出现了MakeFile,但是执行make期间,报出了以下错误:

[ 70%] Running gRPC C++ protocol buffer compiler on PublicDefs.proto
/home/sysadmin/CraneSched-master/generated/protos/: No such file or directory
make[2]: *** [protos/CMakeFiles/crane_proto_lib.dir/build.make:74: /home/sysadmin/CraneSched-master/generated/protos/PublicDefs.grpc.pb.cc] Error 1
make[1]: *** [CMakeFiles/Makefile2:10227: protos/CMakeFiles/crane_proto_lib.dir/all] Error 2
make: *** [Makefile:166: all] Error 2

cmake和make的完整输出请见附件 terminal_output_4_cmake.log terminal_output_4_make.log

RileyWen commented 6 months ago

generated/protos/ 把这个目录mkdir一下就行 cmakelist里面确实少写了一行 ninja会自动建立目录 但是make不会导致出错

MidsummerNight commented 6 months ago

您们好,按照您们的指导执行make以后,出现了该提示:

[100%] Building CXX object src/Craned/CMakeFiles/craned.dir/CranedServer.cpp.o
/home/sysadmin/CraneSched-master/src/Craned/CranedServer.cpp: In member function ‘virtual grpc::Status Craned::CranedServiceImpl::SrunXStream(grpc::ServerContext*, grpc::ServerReaderWriter<crane::grpc::SrunXStreamReply, crane::grpc::SrunXStreamRequest>*)’:
/home/sysadmin/CraneSched-master/src/Craned/CranedServer.cpp:215:56: warning: ‘CraneErr Craned::TaskManager::SpawnInteractiveTaskAsync(uint32_t, std::string, std::__cxx11::list<std::__cxx11::basic_string<char> >, std::function<void(std::__cxx11::basic_string<char>&&, void*)>, std::function<void(bool, int, void*)>)’ is deprecated [-Wdeprecated-declarations]
  215 |             err = g_task_mgr->SpawnInteractiveTaskAsync(
      |                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
  216 |                 task_id, request.exec_info().executive_path(), std::move(args),
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  217 |                 std::move(output_callback), std::move(finish_callback));
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/sysadmin/CraneSched-master/src/Craned/CranedServer.h:24,
                 from /home/sysadmin/CraneSched-master/src/Craned/CranedServer.cpp:17:
/home/sysadmin/CraneSched-master/src/Craned/TaskManager.h:190:27: note: declared here
  190 |   [[deprecated]] CraneErr SpawnInteractiveTaskAsync(
      |                           ^~~~~~~~~~~~~~~~~~~~~~~~~
[100%] Building CXX object src/Craned/CMakeFiles/craned.dir/Craned.cpp.o
[100%] Linking CXX executable craned
[100%] Built target craned

其余目标皆构建完毕,是否意味着CraneSched后端已经编译完毕? 完整输出信息在此 terminal_output_5_make.log

MidsummerNight commented 6 months ago

目前正在编译调度器后端,根据文档执行至第4不,在Crane-FrontEnd/protos下进行“生成proto文件”一步发生如下错误:

[root@el8 protos]# protoc --go_out=../generated --go-grpc_out=../generated ./*
protoc-gen-go-grpc: program not found or is not executable
Please specify a program using absolute path or make sure the program is available in your PATH system variable
--go-grpc_out: protoc-gen-go-grpc: Plugin failed with status code 1.

安装插件时,使用了阿里云的goproxy镜像后安装插件:

export GOPROXY=https://mirrors.aliyun.com/goproxy/
go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest

前面的安装protoc和拉取代码的步骤均顺利执行完,没有报错。

RileyWen commented 6 months ago

后端算是编译完了 前端那个问题查下自己的path 把那个plugin的binary在的目录加进去

MidsummerNight commented 6 months ago

您们好,在卸载go及其模块后,从头按照前端安装文档进行配置,解决了上述问题,但在进行第4步编译二进制文件时,构建cbatchccancelccontrolcinfocinfo均收到提示要求Go版本大于等于1.20。于是我又卸载了根据教程安装的Go 1.17.3,通过dnf install golang安装了AlmaLinux PowerTools源中的Go 1.20.12,其余步骤遵照安装文档进行(GOROOTGOPATH要设置为go env命令所列出来的值),终于完成前端编译。

接下来尝试完成后端的配置工作(从第5步配置PAM开始),之前编译完后端以后忘了继续。

MidsummerNight commented 6 months ago

您们好,您们配置PAM的部分看不出/etc/pam.d/sshd中哪些是红色行,我就把sshd文件完全修改成您们教程中的样子。

在mongodb部分末尾的db.auth("admin","123456")用了中文括号,改为英文之后,又报出了以下信息:

test> db.auth("admin","123456")
MongoServerError[AuthenticationFailed]: Authentication failed.

但是之前创建用户的操作是成功的:

test> use admin
switched to db admin
admin> db.createUser({
...   user:'admin', pwd:'123456', roles:[{ role:'root',db:'admin'}]
... })
{ ok: 1 }
admin>

接下来关闭服务器的操作是这样的:

admin> db.shutdownServer()
MongoNetworkError: connection 5 to 127.0.0.1:27017 closed
admin> quit

admin的密码的确设置为了123456) 我们之前没有接触过MongoDB,是不是哪里搞错了?

RileyWen commented 6 months ago

麻烦这些配置请谷歌一下吧 算是比较基础的内容了

On Tue, Mar 19, 2024 at 11:16 Steve @.***> wrote:

您们好,您们配置PAM的部分看不出/etc/pam.d/sshd中哪些是红色行,我就把sshd文件完全修改成您们教程中的样子。

在mongodb部分末尾的db.auth("admin","123456")用了中文括号,改为英文之后,又报出了以下信息:

test> db.auth("admin","123456") MongoServerError[AuthenticationFailed]: Authentication failed.

但是之前创建用户的操作是成功的:

test> use admin switched to db admin admin> db.createUser({ ... user:'admin', pwd:'123456', roles:[{ role:'root',db:'admin'}] ... }) { ok: 1 } admin>

接下来关闭服务器的操作是这样的:

admin> db.shutdownServer() MongoNetworkError: connection 5 to 127.0.0.1:27017 closed admin> quit

(admin的密码的确设置为了123456) 我们之前没有接触过MongoDB,是不是哪里搞错了?

— Reply to this email directly, view it on GitHub https://github.com/PKUHPC/CraneSched/issues/237#issuecomment-2005677609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHVVKZWAGTQ6USS4K22ORGLYY6UYTAVCNFSM6AAAAABEKKYYAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBVGY3TONRQHE . You are receiving this because you commented.Message ID: @.***>