apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Windows CI CUDA Intermittent error C2993 #17935

Open ChaiBapchya opened 4 years ago

ChaiBapchya commented 4 years ago

Description

Intermittent failure seen on windows-gpu compilation phase (WIN_GPU/WIN_GPU_MKLDNN)

Discovered in this PR : https://github.com/apache/incubator-mxnet/pull/17808

Related to https://github.com/pytorch/pytorch/issues/25393

Error Message

It intermittently gives the error :

C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2993: 'T': illegal type for non-type template parameter '__formal

Errors:

[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2993: 'T': illegal type for non-type template parameter '__formal'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_value_type<T>' being compiled
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2065: 'U1': undeclared identifier
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2923: 'std::_Select<__formal>::_Apply': 'U1' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ')'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ';'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2238: unexpected token(s) preceding ';'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: ')'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2988: unrecognizable template declaration/definition
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: '<end Parse>'

Entire stack trace: http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-17808/runs/15/nodes/39/log/?start=0

To Reproduce

Build using Windows AMI and run Clone repo & py -3 ci/build_windows.py -f WIN_GPU

What have you tried to solve it?

  1. Use cuda 10.2 instead of 9.2
  2. Updated VS2019
  3. Add cmake flag : /Zc:__cplusplus

Currently, what is found to work: Introduced max retries = 5

ChaiBapchya commented 4 years ago

@mxnet-label-bot add [ci, windows]

leezu commented 4 years ago

Created an upstream issue: https://github.com/thrust/thrust/issues/1090

leezu commented 4 years ago

@vexilligera did you test if the error also occurs on more recent versions of thrust? I suggest we try installing thrust 1.9.8 version on Windows CI, which is the version that'll be shipped with Cuda 11

We do that on Ubuntu CI already

https://github.com/apache/incubator-mxnet/blob/76fa58373636c57fee1e4e6cd7960723b39f455f/ci/docker/Dockerfile.build.ubuntu#L144-L150

leezu commented 4 years ago

There is another suggested fix at https://github.com/pytorch/pytorch/issues/25393#issuecomment-619547577

cc @vexilligera

leezu commented 4 years ago

Seems to be a nvcc bug https://github.com/thrust/thrust/issues/1090#issuecomment-626080333

alliepiper commented 4 years ago

This is indeed an nvcc bug. There is no known workaround at the moment, but the next release of the CUDA toolkit will contain a fix.

Ref thrust/thrust#1090.