Tlntin commented 2 years ago

为啥要自己编译

因为官方包不支持ubuntu22.04(系统自带gcc版本与glibc版本太高)

ImportError: /home/ubuntu/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/core_avx.so: undefined symbol: _dl_sym, version GLIBC_PRIVATE

参考链接：https://www.paddlepaddle.org.cn/documentation/docs/zh/install/compile/linux-compile.html
原理，将利用conda安装gcc/glibc，不改系统环境的gcc/glibc，防止系统出现莫名其妙的bug。

环境

自带python环境(其实没啥影响，只是展示一下)
```
$ /usr/bin/python3 --version
Python 3.10.4
```
cmake环境(建议版本装高一下，貌似要3.19以上)
```
cmake --version
cmake version 3.22.1
```

CMake suite maintained and supported by Kitware (kitware.com/cmake).


-  自带gcc环境(其实没啥影响，只是展示一下)
```bash
$ gcc --version
gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

系统描述

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

cuda环境

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

cudnn环境
由于conda安装的gcc不会读取系统环境的c/c++ include，所以cudnn只能用tar包的方式安装。
选择的tar.xz的包为：cudnn-linux-x86_64-8.4.1.50_cuda11.6-archive.tar.xz

简易安装教程如下：


tar -xvf cudnn-linux-x86_64-8.4.1.50_cuda11.6-archive.tar.xz
cd cudnn-linux-x86_64-8.4.1.50_cuda11.6-archive
sudo cp -r include/* /usr/local/cuda/include/
sudo cp -r lib/* /usr/local/cuda/lib64
# 刷新库缓存，并查看安装结果
sudo ldconfig -v | grep libcudnn

结果如下，已能识别到cudnn, 8.4.1

/sbin/ldconfig.real: Path `/usr/lib' given more than once (from :0 and :0) libcudnn_ops_train.so.8 -> libcudnn_ops_train.so.8.4.1 libcudnn_cnn_train.so.8 -> libcudnn_cnn_train.so.8.4.1 libcudnn_ops_infer.so.8 -> libcudnn_ops_infer.so.8.4.1 libcudnn_adv_infer.so.8 -> libcudnn_adv_infer.so.8.4.1 libcudnn.so.8 -> libcudnn.so.8.4.1 libcudnn_adv_train.so.8 -> libcudnn_adv_train.so.8.4.1 libcudnn_cnn_infer.so.8 -> libcudnn_cnn_infer.so.8.4.1


- 安装nccl (这个是多卡才需要的，但是编译的时候关闭多卡选项貌似也提示要装这个，所以只能安装一下了，并且只能压缩包或者源码安装。)
- 官方安装方法（压缩包安装）
- 官网下载路径：https://developer.nvidia.com/nccl/nccl-download
- 选择“O/S agnostic local installer”
- 之后解压并复制到cuda目录即可
```bash
tar -xvf nccl_2.13.4-1+cuda11.7_x86_64.txz
cd nccl_2.13.4-1+cuda11.7_x86_64
sudo cp -r include/* /usr/local/cuda/include/
sudo cp -r lib/* /usr/local/cuda/lib64

源码安装（推荐，毕竟用自己的cuda编译出来的兼容性更好一些）

git clone https://github.com/NVIDIA/nccl.git
cd nccl
git checkout v2.13.4-1
make pkg.txz.build -j12
# 如果出现大量sm35弃用警告，可以删除makefiles/common.mk中-gencode=arch=compute_35,code=sm_35，不删也没关系。
# 修改前
CUDA8_GENCODE = -gencode=arch=compute_35,code=sm_35 \
                        -gencode=arch=compute_50,code=sm_50 \
                        -gencode=arch=compute_60,code=sm_60 \
                        -gencode=arch=compute_61,code=sm_61
# 修改后
CUDA8_GENCODE = -gencode=arch=compute_50,code=sm_50 \
                        -gencode=arch=compute_60,code=sm_60 \
                        -gencode=arch=compute_61,code=sm_61
# 编译大概需要20分钟左右。
cd build/pkg/txz
tar -xvf nccl_2.13.4-1+cuda11.6_x86_64.txz
cd nccl_2.13.4-1+cuda11.6_x86_64
sudo cp -r include/* /usr/local/cuda/include/
sudo cp -r lib/* /usr/local/cuda/lib64

准备工作

创建并激活虚拟环境

conda create -n paddle python==3.9.12
conda activate paddle

获取python相关信息


find `dirname $(dirname $(which python3))` -name "libpython3.so" > /tmp/temp1 && export PYTHON_LIBRARY=$(cat /tmp/temp1 | xargs -L 1)

export PATH=${PYTHON_LIBRARY}:$PATH

find dirname $(dirname $(which python3))/include -name "python3.9" > /tmp/temp2 && export PYTHON_INCLUDE_DIRS=$(cat /tmp/temp2 | xargs -L 1)

export PYTHON3_EXECUTABLE=$(for dirname in whereis python3; do echo $dirname > /tmp/tmp3 | cat /tmp/tmp3 | grep env ; done;)

- 打印python变量
```bash
echo PYTHON_LIBRARY=${PYTHON_LIBRARY}
echo PYTHON_INCLUDE_DIRS=${PYTHON_INCLUDE_DIRS}
echo PYTHON3_EXECUTABLE=${PYTHON3_EXECUTABLE}

# 结果如下
PYTHON_LIBRARY=/home/tlntin/anaconda3/envs/paddle/lib/libpython3.so
PYTHON_INCLUDE_DIRS=/home/tlntin/anaconda3/envs/paddle/include/python3.9
PYTHON3_EXECUTABLE=/home/tlntin/anaconda3/envs/paddle/bin/python3

安装numpy
```
pip install numpy
```

export PYTHON3_NUMPY_INCLUDE_DIRS=python -c "import numpy as np; print(np.__path__[0] + '/core/include')" echo PYTHON3_NUMPY_INCLUDE_DIRS=$PYTHON3_NUMPY_INCLUDE_DIRS

- 安装protobuf
```bash
pip install protobuf==3.20.0

安装patchelf
```
pip install patchelf
```

安装gcc-8,g++-8,glibc-2.17(因为paddle用的protobuf最高只支持gcc-8编译器)

# 建议用代理运行，不然比较慢
# 设置代理
conda config --set proxy_servers.http http://xxxx
# 安装
conda install -c conda-forge gcc=8 gxx=8 sysroot_linux-64=2.17

重新检查你的gcc/g++版本（只影响虚拟环境，不影响系统环境）


$ gcc --version
gcc (conda-forge gcc 8.5.0-16) 8.5.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ g++ --version g++ (conda-forge gcc 8.5.0-16) 8.5.0 Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


- 安装yaml(编译过程中提示找不到yaml模块，所以安装一下)
```bash
pip install pyyaml

编译过程

拉取源码，切换最新分支

git clone https://github.com/PaddlePaddle/Paddle.git
cd Paddle
git checkout release/2.3

创建并进入build
```
mkdir build && cd build
```
设置目标paddle版本
```
export PADDLE_VERSION="2.3.1"
```

准备编译（未开启TensorRT）

cmake  .. \
-DWITH_CONTRIB=OFF \
-DWITH_MKL=ON \
-DWITH_MKLDNN=ON  \
-DWITH_TESTING=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DWITH_INFERENCE_API_TEST=OFF \
-DWITH_GPU=ON \
-DCUDNN_ROOT=/usr/local/cuda \
-DON_INFER=ON \
-DWITH_PYTHON=ON \
-D PYTHON3_EXECUTABLE=${PYTHON3_EXECUTABLE} \
-D PYTHON3_INCLUDE_DIR=${PYTHON3_INCLUDE_DIR} \
-D PYTHON3_LIBRARY=${PYTHON3_LIBRARY} \
-D PYTHON3_NUMPY_INCLUDE_DIRS=${PYTHON3_NUMPY_INCLUDE_DIRS}  \
-D WITH_GPU=ON \
-D WITH_TENSORRT=OFF

正式编译(注意，该步骤需要科学上网，因为make的时候需要从github拉取第三方库源码)，大概等待个1-2小时左右，差不多就可以了。
```
make -j10
```
- 编译到一半报错，error too many open files，需要修改最大打开文件限制，默认是1024
```
# 修改前为1024
$ ulimit -Sn
1024
# 修改为9192
ulimit -n 9192
# 修改后
$ ulimit -Sn
9192
```
- 修改后重新继续编译，之前的进度可以保留
```
make -j10
```

获取安装包，安装包在build目录下面的python/dist目录下，文件属性如下：

cd python/dist
ls -lh
.rw-r--r-- ubuntu ubuntu 167 MB Wed Jul 27 17:33:44 2022  paddlepaddle_gpu-0.0.0-cp39-cp39-linux_x86_64.whl

安装安装包(理论上和我相同cuda/cudnn/nccl版本，且cudnn/nccl都为zip安装，30系列显卡的ubuntu22.04/20.04都能用该包)，为啥版本显示0.0.0，是因为所有自己编译的都这么显示。
```
pip install paddlepaddle_gpu-0.0.0-cp39-cp39-linux_x86_64.whl
```

测试效果

$ python3
Python 3.9.12 (main, Jun  1 2022, 11:38:51)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle
>>> paddle.utils.run_check()
Running verify PaddlePaddle program ...
W0727 17:46:03.775210 12918 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.6
W0727 17:46:03.796252 12918 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
PaddlePaddle works well on 1 GPU.
I0727 17:46:06.006351 12918 parallel_executor.cc:486] Cross op memory reuse strategy is enabled, when build_strategy.memory_optimize = True or garbage collection strategy is disabled, which is not recommended
PaddlePaddle works well on 1 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

跑一下官方测试代码，貌似也正常，可以正常用GPU进行训练。

$ python3 test_paddle.py
数据集标签共有10种, 分别为：[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
W0727 17:54:46.313586 13232 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.6
W0727 17:54:46.325191 13232 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
step 938/938 [==============================] - loss: 0.1149 - acc: 0.9398 - 14ms/step
Epoch 2/5
step 938/938 [==============================] - loss: 0.0688 - acc: 0.9760 - 13ms/step
Epoch 3/5
step 938/938 [==============================] - loss: 0.0354 - acc: 0.9809 - 11ms/step
Epoch 4/5
step 938/938 [==============================] - loss: 0.0052 - acc: 0.9833 - 13ms/step
Epoch 5/5
step 938/938 [==============================] - loss: 0.0110 - acc: 0.9855 - 12ms/step

代码内容如下：
```
import paddle
```

设置使用GPU

paddle.device.set_device("gpu:0")

from paddle.vision.transforms import Normalize from paddle.vision.datasets import MNIST from paddle.vision.models import LeNet import numpy as np

拉取数据集

transform = Normalize(mean=[127.5], std=[127.5], data_format="CHW") train_dataset = MNIST(mode="train", transform=transform) valid_dataset = MNIST(mode="test", transform=transform)

获取数据集类别

y_list = [da[1][0] for da in train_dataset] num_list = list(set(y_list)) num_classes = len(num_list) print(f"数据集标签共有{num_classes}种, 分别为：{num_list}")

构建模型

pre_mdoel = LeNet(num_classes=num_classes) model = paddle.Model(pre_mdoel) adam = paddle.optimizer.Adam(learning_rate=1e-3, parameters=model.parameters()) model.prepare(adam, loss=paddle.nn.CrossEntropyLoss(), metrics=paddle.metric.Accuracy())

训练模型

model.fit(train_data=train_dataset, batch_size=64, verbose=1, epochs=5)


- 显卡使用正常
```bash
$ nvidia-smi
Wed Jul 27 17:55:58 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 516.59       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 36%   34C    P2   120W / 370W |   3228MiB / 24576MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     13353      C   /python3.9                      N/A      |
+-----------------------------------------------------------------------------+

相关文件分享：https://cloud.189.cn/t/RJ7Rz2IzuaMv (访问码:ym82)

paddle-bot[bot] commented 2 years ago

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

frankwhzhang commented 2 years ago

您好，感谢您的分享和共享，请问还有什么问题需要跟进嘛

Tlntin commented 2 years ago

补充一条，版本号可以在执行cmake命令前export PADDLE_VERSION="2.3.1"让程序正确识别。没有了。

tikboaHIT commented 2 years ago

补充一条，版本号可以在执行cmake命令前export PADDLE_VERSION="2.3.1"让程序正确识别。没有了。

非常感谢你的安装指导，参考你的步骤逐渐安装后，碰到如下cu代码编译问题：

1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/reverse.cu". 1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/pad.cu". make[2]: [paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/build.make:314：paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/pad.cu.o] 错误 1 1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/broadcast.cu". make[2]: [paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/build.make:230：paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/broadcast.cu.o] 错误 1 make[1]: [CMakeFiles/Makefile2:56411：paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/all] 错误 2 make[1]: 正在等待未完成的任务.... [ 10%] Linking CXX static library libscope.a [ 10%] Built target scope make: *** [Makefile:136：all] 错误 2

你在编译过程中有碰到这些问题吗？若有是怎么解决的呢？

Tlntin commented 2 years ago

补充一条，版本号可以在执行cmake命令前export PADDLE_VERSION="2.3.1"让程序正确识别。没有了。

非常感谢你的安装指导，参考你的步骤逐渐安装后，碰到如下cu代码编译问题：

1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/reverse.cu". 1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/pad.cu". make[2]: [paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/build.make:314：paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/pad.cu.o] 错误 1 1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/broadcast.cu". make[2]: [paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/build.make:230：paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/broadcast.cu.o] 错误 1 make[1]: [CMakeFiles/Makefile2:56411：paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/all] 错误 2 make[1]: 正在等待未完成的任务.... [ 10%] Linking CXX static library libscope.a [ 10%] Built target scope make: *** [Makefile:136：all] 错误 2

你在编译过程中有碰到这些问题吗？若有是怎么解决的呢？

我没遇到这些问题，你的cuda,cudnn,nccl环境是？是否都装在/usr/local/cuda?