PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.29k stars 5.61k forks source link

Unable to compile paddle 3.0.0.beta0, Assertion `idx < size()' failed. #68763

Open xuesu opened 1 month ago

xuesu commented 1 month ago

问题描述 Issue Description

🔎 Search before asking

🐛 Bug (问题描述)

I got the following error:

eager_generator: /home/iris/CDeepFuzz/Paddle/paddle/utils/small_vector.h:343: T& paddle::small_vector_template_common<T, <template-parameter-1-2> >::at(paddle::small_vector_template_common<T, <template-parameter-1-2> >::size_type) [with T = phi::TensorArgDef; <template-parameter-1-2> = void; paddle::small_vector_template_common<T, <template-parameter-1-2> >::reference = phi::TensorArgDef&; paddle::small_vector_template_common<T, <template-parameter-1-2> >::size_type = long unsigned int]: Assertion `idx < size()' failed.
Subprocess aborted
gmake[2]: *** [paddle/fluid/eager/auto_code_generator/CMakeFiles/legacy_eager_codegen.dir/build.make:70: paddle/fluid/eager/auto_code_generator/CMakeFiles/legacy_eager_codegen] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:66694: paddle/fluid/eager/auto_code_generator/CMakeFiles/legacy_eager_codegen.dir/all] Error 2

I added some print logs at :

  reference at(size_type idx) {
    std::string prompt = "LALALA: ";
    prompt += std::to_string(idx) + "," + std::to_string(size()); 
    std::cout << prompt << std::endl;
    assert(idx < size());
    return begin()[idx];
  }

I got:

LALALA: 1, 1

so the idx(1) is equal to size()(1).

I added traceback at:

@@ -333,6 +341,12 @@ class small_vector_template_common
   }

   reference at(size_type idx) {
+    if(idx == size()){
+      void *buffer[100];
+      int nptrs = backtrace(buffer, 100);  // Capture up to 100 frames
+      std::cerr << "Stack trace:\n";
+      backtrace_symbols_fd(buffer, nptrs, STDERR_FILENO);  // Print the stack trace
+    }
     assert(idx < size());
     return begin()

I got:

Stack trace:
~/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator(_ZN6paddle28small_vector_template_commonIN3phi12TensorArgDefEvE2atEm+0x5d)[0x5aaefee7d309]
~/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator(_ZN3phi6Kernel7InputAtEm+0x36)[0x5aaefee75700]
~/Paddle/build/paddle/phi/libphi_kernel_gpu.so(+0x4249c8c)[0x74579bc49c8c]
~/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator(_ZN3phi15KernelRegistrar15ConstructKernelENS_7RegTypeEPKcS3_N6common10DataLayoutENS_8DataTypeEPFvRKNS_9KernelKeyEPNS_13KernelArgsDefEEPFvS9_PNS_6KernelEESt8functionIFvPNS_13KernelContextEEEPv+0x1ca)[0x5aaefee3da7a]
~/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator(_ZN3phi15KernelRegistrarC1ENS_7RegTypeEPKcS3_N6common10DataLayoutENS_8DataTypeEPFvRKNS_9KernelKeyEPNS_13KernelArgsDefEEPFvS9_PNS_6KernelEESt8functionIFvPNS_13KernelContextEEEPv+0x9f)[0x5aaefee75839]
~/Paddle/build/paddle/phi/libphi_kernel_gpu.so(+0x424ab15)[0x74579bc4ab15]
~/Paddle/build/paddle/phi/libphi_kernel_gpu.so(+0x424addd)[0x74579bc4addd]
/lib64/ld-linux-x86-64.so.2(+0x647e)[0x7457cee2947e]
/lib64/ld-linux-x86-64.so.2(+0x6568)[0x7457cee29568]
/lib64/ld-linux-x86-64.so.2(+0x202ca)[0x7457cee432ca]

I wonder if this is because all source files under the folder paddle/fluid/eager/api/generated/fluid_generated/forwards/(e.g.: dygraph_forward_functions3.cc), are empty, but this function(https://github.com/jiaoxuewu/PaddleBox/blob/7552ba29f6b729f3192b4747283770b254433c8b/paddle/fluid/eager/auto_code_generator/generate_file_structures.py#L98) suggests that those files should be empty: GenerateFileStructureForIntermediateDygraph....

Sorry for writing in English...

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

cmake ..   -DWITH_GPU=ON   -DWITH_TESTING=ON   -DWITH_DISTRIBUTE=ON   -DCMAKE_BUILD_TYPE=Debug   -DWITH_MKL=ON   -DWITH_PYTHON=ON -DCMAKE_C_COMPILER=clang  -DCMAKE_CXX_COMPILER=clang++
cmake --build . -j 1

or

cd ~/Paddle/build/paddle/fluid/eager/auto_code_generator &&  ~/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator ~/Paddle/paddle/fluid/eager/api/generated/fluid_generated 8

版本&环境信息 Version & Environment Information

🏃‍♂️ Environment (运行环境)

OS: ubuntu 22.04 GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 17.0.6 (https://github.com/llvm/llvm-project.git 6009708b4367171ccdbf4b5905cb6a803753fe18) CMake version: version 3.22.1 Libc version: glibc 2.35 Python version: 3.10.15

CUDA version: 12.4.131 Build cuda_12.4.r12.4/compiler.34097967_0 cuDNN version: 9.4.0 Nvidia driver version: 560.35.03 Nvidia driver List: GPU 0: NVIDIA GeForce RTX 4090 GCC: gcc 11 Clang: 17.0.6 (tried both GCC and Clang) Memory: 64GB Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i9-14900KF CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 1 CPU max MHz: 6000.0000 CPU min MHz: 800.0000 BogoMIPS: 6374.40

risemeup1 commented 1 month ago

paddle单线程编译make -j1一直都有问题,编译不过

risemeup1 commented 1 month ago

我们本地也在复现

xuesu commented 1 month ago

非常感谢!贵司是我看到的回复最即时的类似库!其实-j50也是报一样的错误

xuesu commented 1 month ago

有点不好意思,但是TensorArgDef OutputAt(size_t idx) { return args_def().input_defs()[idx]; }这是刻意这么写的么?我看其他到的头文件没有这么写呀。。。https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/capi/include/wrapper_base.h ln553

xuesu commented 1 month ago

出问题的kernel是

PD_REGISTER_KERNEL(eigvalsh,  // cuda_only
                   GPU,
                   ALL_LAYOUT,
                   phi::EigvalshKernel,
                   float,
                   double,
                   phi::dtype::complex<float>,
                   phi::dtype::complex<double>) {
  kernel->InputAt(1).SetDataType(phi::dtype::ToReal(kernel_key.dtype()));

这里input只有1个,但是却要求第1个(也就是第二个)的input datatype为REAL。那么这里到底是

  1. 要求input应当至少有2个
  2. 第0个input data type为REAL(我的猜测)
    • 因为这个如果和np.linalg.eigvalsh相同的话,那应该只有一个input才对。如果意图是把整数矩阵转化为浮点数矩阵那也有可能
    • forward : eigvalsh (Tensor x, str uplo = "L", bool is_test = false) -> Tensor(eigenvalues), Tensor(eigenvectors)
    • 可是CPU或者GPU似乎都有很多只有一个x作为输入,但是却要求kernel->InputAt(1).SetDataType(phi::dtype::ToReal(kernel_key.dtype()));
  3. 在特殊语意中对应的input强制为REAL
  4. at的语义发生了变化
  5. 是因为cuda12.4不兼容?但是还在register kernel中?
xuesu commented 1 month ago
>python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
>python -m unittest test_eigvalsh_op.py
Illegal instruction (core dumped)
xuesu commented 1 month ago

我关掉了-DWITH_TESTING,错误不变。另外我无法用-DWITH_TESTING来编译该库。 我使用了kernel->InputAt(0).SetDataType(phi::dtype::ToReal(kernel_key.dtype()));编译成功