BR-IDL / PaddleViT

:robot: PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+
https://github.com/BR-IDL/PaddleViT
Apache License 2.0
1.22k stars 319 forks source link

[fixed] Training Error when using larger batch size #25

Closed xperzy closed 3 years ago

xperzy commented 3 years ago

Error: Training will fail when using larger batch: SystemError: (Fatal) Operator set_value raises an thrust::system::system_error exception. The exception content is :parallel_for failed: cudaErrorInvalidConfiguration: invalid configuration argument. (at /paddle/paddle/fluid/imperative/tracer.cc:192)

Reason: The reason is explained by the following issues from PaddlePaddle: https://github.com/PaddlePaddle/Paddle/issues/33057#issuecomment-847719249

In short, this error is raised because of cuda thrust bug, which is ignored in newer version cuda.

Solution: install paddle dev version will fix the problem. You will find the following instructions of how to install it: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html

In detail, the problem is fixed by the following patch: https://github.com/PaddlePaddle/Paddle/pull/33748/files/617e3eda9dfcd76cb6a7ebaa1535340f1023d3f1

xperzy commented 3 years ago

This issue is fixed by installing new version of paddle. So I closed this issue