PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.32k stars 5.63k forks source link

ubuntu 22.04(cuda-11.6+cudnn8.4)编译paddle #44668

Closed Tlntin closed 2 years ago

Tlntin commented 2 years ago

为啥要自己编译

环境

CMake suite maintained and supported by Kitware (kitware.com/cmake).


-  自带gcc环境(其实没啥影响,只是展示一下)
```bash
$ gcc --version
gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

结果如下,已能识别到cudnn, 8.4.1

/sbin/ldconfig.real: Path `/usr/lib' given more than once (from :0 and :0) libcudnn_ops_train.so.8 -> libcudnn_ops_train.so.8.4.1 libcudnn_cnn_train.so.8 -> libcudnn_cnn_train.so.8.4.1 libcudnn_ops_infer.so.8 -> libcudnn_ops_infer.so.8.4.1 libcudnn_adv_infer.so.8 -> libcudnn_adv_infer.so.8.4.1 libcudnn.so.8 -> libcudnn.so.8.4.1 libcudnn_adv_train.so.8 -> libcudnn_adv_train.so.8.4.1 libcudnn_cnn_infer.so.8 -> libcudnn_cnn_infer.so.8.4.1


- 安装nccl (这个是多卡才需要的,但是编译的时候关闭多卡选项貌似也提示要装这个,所以只能安装一下了,并且只能压缩包或者源码安装。)
- 官方安装方法(压缩包安装)
- 官网下载路径:https://developer.nvidia.com/nccl/nccl-download
- 选择“O/S agnostic local installer”
- 之后解压并复制到cuda目录即可
```bash
tar -xvf nccl_2.13.4-1+cuda11.7_x86_64.txz
cd nccl_2.13.4-1+cuda11.7_x86_64
sudo cp -r include/* /usr/local/cuda/include/
sudo cp -r lib/* /usr/local/cuda/lib64

准备工作

export PATH=${PYTHON_LIBRARY}:$PATH

find dirname $(dirname $(which python3))/include -name "python3.9" > /tmp/temp2 && export PYTHON_INCLUDE_DIRS=$(cat /tmp/temp2 | xargs -L 1)

export PYTHON3_EXECUTABLE=$(for dirname in whereis python3; do echo $dirname > /tmp/tmp3 | cat /tmp/tmp3 | grep env ; done;)

- 打印python变量
```bash
echo PYTHON_LIBRARY=${PYTHON_LIBRARY}
echo PYTHON_INCLUDE_DIRS=${PYTHON_INCLUDE_DIRS}
echo PYTHON3_EXECUTABLE=${PYTHON3_EXECUTABLE}

# 结果如下
PYTHON_LIBRARY=/home/tlntin/anaconda3/envs/paddle/lib/libpython3.so
PYTHON_INCLUDE_DIRS=/home/tlntin/anaconda3/envs/paddle/include/python3.9
PYTHON3_EXECUTABLE=/home/tlntin/anaconda3/envs/paddle/bin/python3

export PYTHON3_NUMPY_INCLUDE_DIRS=python -c "import numpy as np; print(np.__path__[0] + '/core/include')" echo PYTHON3_NUMPY_INCLUDE_DIRS=$PYTHON3_NUMPY_INCLUDE_DIRS

- 安装protobuf
```bash
pip install protobuf==3.20.0

$ g++ --version g++ (conda-forge gcc 8.5.0-16) 8.5.0 Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


- 安装yaml(编译过程中提示找不到yaml模块,所以安装一下)
```bash
pip install pyyaml

编译过程

  1. 拉取源码,切换最新分支

    git clone https://github.com/PaddlePaddle/Paddle.git
    cd Paddle
    git checkout release/2.3
  2. 创建并进入build

    mkdir build && cd build
  3. 设置目标paddle版本

    export PADDLE_VERSION="2.3.1"
  4. 准备编译(未开启TensorRT)

    cmake  .. \
    -DWITH_CONTRIB=OFF \
    -DWITH_MKL=ON \
    -DWITH_MKLDNN=ON  \
    -DWITH_TESTING=OFF \
    -DCMAKE_BUILD_TYPE=Release \
    -DWITH_INFERENCE_API_TEST=OFF \
    -DWITH_GPU=ON \
    -DCUDNN_ROOT=/usr/local/cuda \
    -DON_INFER=ON \
    -DWITH_PYTHON=ON \
    -D PYTHON3_EXECUTABLE=${PYTHON3_EXECUTABLE} \
    -D PYTHON3_INCLUDE_DIR=${PYTHON3_INCLUDE_DIR} \
    -D PYTHON3_LIBRARY=${PYTHON3_LIBRARY} \
    -D PYTHON3_NUMPY_INCLUDE_DIRS=${PYTHON3_NUMPY_INCLUDE_DIRS}  \
    -D WITH_GPU=ON \
    -D WITH_TENSORRT=OFF
  5. 正式编译(注意,该步骤需要科学上网,因为make的时候需要从github拉取第三方库源码),大概等待个1-2小时左右,差不多就可以了。

    make -j10
    • 编译到一半报错,error too many open files,需要修改最大打开文件限制,默认是1024
      # 修改前为1024
      $ ulimit -Sn
      1024
      # 修改为9192
      ulimit -n 9192
      # 修改后
      $ ulimit -Sn
      9192
    • 修改后重新继续编译,之前的进度可以保留
      make -j10
  6. 获取安装包,安装包在build目录下面的python/dist目录下,文件属性如下:

    cd python/dist
    ls -lh
    .rw-r--r-- ubuntu ubuntu 167 MB Wed Jul 27 17:33:44 2022  paddlepaddle_gpu-0.0.0-cp39-cp39-linux_x86_64.whl
  7. 安装安装包(理论上和我相同cuda/cudnn/nccl版本,且cudnn/nccl都为zip安装,30系列显卡的ubuntu22.04/20.04都能用该包),为啥版本显示0.0.0,是因为所有自己编译的都这么显示。

    pip install paddlepaddle_gpu-0.0.0-cp39-cp39-linux_x86_64.whl
  8. 测试效果

    $ python3
    Python 3.9.12 (main, Jun  1 2022, 11:38:51)
    [GCC 7.5.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import paddle
    >>> paddle.utils.run_check()
    Running verify PaddlePaddle program ...
    W0727 17:46:03.775210 12918 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.6
    W0727 17:46:03.796252 12918 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
    PaddlePaddle works well on 1 GPU.
    I0727 17:46:06.006351 12918 parallel_executor.cc:486] Cross op memory reuse strategy is enabled, when build_strategy.memory_optimize = True or garbage collection strategy is disabled, which is not recommended
    PaddlePaddle works well on 1 GPUs.
    PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
  9. 跑一下官方测试代码,貌似也正常,可以正常用GPU进行训练。

    $ python3 test_paddle.py
    数据集标签共有10种, 分别为:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    W0727 17:54:46.313586 13232 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.6
    W0727 17:54:46.325191 13232 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
    The loss value printed in the log is the current step, and the metric is the average value of previous steps.
    Epoch 1/5
    step 938/938 [==============================] - loss: 0.1149 - acc: 0.9398 - 14ms/step
    Epoch 2/5
    step 938/938 [==============================] - loss: 0.0688 - acc: 0.9760 - 13ms/step
    Epoch 3/5
    step 938/938 [==============================] - loss: 0.0354 - acc: 0.9809 - 11ms/step
    Epoch 4/5
    step 938/938 [==============================] - loss: 0.0052 - acc: 0.9833 - 13ms/step
    Epoch 5/5
    step 938/938 [==============================] - loss: 0.0110 - acc: 0.9855 - 12ms/step
    • 代码内容如下:
      
      import paddle

设置使用GPU

paddle.device.set_device("gpu:0")

from paddle.vision.transforms import Normalize from paddle.vision.datasets import MNIST from paddle.vision.models import LeNet import numpy as np

拉取数据集

transform = Normalize(mean=[127.5], std=[127.5], data_format="CHW") train_dataset = MNIST(mode="train", transform=transform) valid_dataset = MNIST(mode="test", transform=transform)

获取数据集类别

y_list = [da[1][0] for da in train_dataset] num_list = list(set(y_list)) num_classes = len(num_list) print(f"数据集标签共有{num_classes}种, 分别为:{num_list}")

构建模型

pre_mdoel = LeNet(num_classes=num_classes) model = paddle.Model(pre_mdoel) adam = paddle.optimizer.Adam(learning_rate=1e-3, parameters=model.parameters()) model.prepare(adam, loss=paddle.nn.CrossEntropyLoss(), metrics=paddle.metric.Accuracy())

训练模型

model.fit(train_data=train_dataset, batch_size=64, verbose=1, epochs=5)


- 显卡使用正常
```bash
$ nvidia-smi
Wed Jul 27 17:55:58 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 516.59       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 36%   34C    P2   120W / 370W |   3228MiB / 24576MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     13353      C   /python3.9                      N/A      |
+-----------------------------------------------------------------------------+
paddle-bot[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

frankwhzhang commented 2 years ago

您好,感谢您的分享和共享,请问还有什么问题需要跟进嘛

Tlntin commented 2 years ago

补充一条,版本号可以在执行cmake命令前export PADDLE_VERSION="2.3.1"让程序正确识别。 没有了。

tikboaHIT commented 2 years ago

补充一条,版本号可以在执行cmake命令前export PADDLE_VERSION="2.3.1"让程序正确识别。 没有了。

非常感谢你的安装指导,参考你的步骤逐渐安装后,碰到如下cu代码编译问题:

1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/reverse.cu". 1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/pad.cu". make[2]: [paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/build.make:314:paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/pad.cu.o] 错误 1 1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/broadcast.cu". make[2]: [paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/build.make:230:paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/broadcast.cu.o] 错误 1 make[1]: [CMakeFiles/Makefile2:56411:paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/all] 错误 2 make[1]: 正在等待未完成的任务.... [ 10%] Linking CXX static library libscope.a [ 10%] Built target scope make: *** [Makefile:136:all] 错误 2

你在编译过程中有碰到这些问题吗?若有是怎么解决的呢?

Tlntin commented 2 years ago

补充一条,版本号可以在执行cmake命令前export PADDLE_VERSION="2.3.1"让程序正确识别。 没有了。

非常感谢你的安装指导,参考你的步骤逐渐安装后,碰到如下cu代码编译问题:

1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/reverse.cu". 1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/pad.cu". make[2]: [paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/build.make:314:paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/pad.cu.o] 错误 1 1 error detected in the compilation of "/home/dell/code/Paddle/paddle/phi/kernels/funcs/eigen/broadcast.cu". make[2]: [paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/build.make:230:paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/broadcast.cu.o] 错误 1 make[1]: [CMakeFiles/Makefile2:56411:paddle/phi/kernels/funcs/eigen/CMakeFiles/eigen_function.dir/all] 错误 2 make[1]: 正在等待未完成的任务.... [ 10%] Linking CXX static library libscope.a [ 10%] Built target scope make: *** [Makefile:136:all] 错误 2

你在编译过程中有碰到这些问题吗?若有是怎么解决的呢?

我没遇到这些问题,你的cuda,cudnn,nccl环境是?是否都装在/usr/local/cuda?

fmscole commented 2 years ago

费了半天劲,从20.04升级到22.04,发现paddle不行了,哎。。。

dingjiaweiww commented 2 years ago

飞桨develop 版本已经支持ubuntu 22.04,2.4版本会发版支持

nabilragab commented 2 years ago

DISTRIB_RELEASE=22.04 DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

Thanks for shortening the journey, Build time 1h 35m

zgpnuaa commented 2 years ago

Paddle 2.4会支持ubuntu 22.04吗,预计什么时候发布啊。

monkeycc commented 3 months ago

笑死 "Ubuntu 24.04 LTS" 居然也有这个问题