Open anirudh2290 opened 5 years ago
Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Cuda, Installation, Build
Do you also see those errors when testing operator that does not have issues?
i tested with broadcast, countsketch, embedding. all ops failed with this error:
Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute.
In 10.1 broadcast and embedding has no issue, countsketch had a read out of bounds issue which is specific to operator.
But the cuda memcheck issue invalid device function happened for the three ops i tested for cuda 10.
@mxnet-label-bot add [Cuda]
This was encountered during work on the PR: https://github.com/apache/incubator-mxnet/pull/15118. This is also related to https://github.com/apache/incubator-mxnet/issues/10988.
There are a lot of cuda-memcheck failures when MXNet is built with CUDA-10.0 which I don't see happening on CUDA-9.2.
On CUDA-9.2:
On CUDA-10.0
cuda memcheck output: more than 1000 errors
When I change to cuda 10.1 these errors go away. Note that I have only observed them with DEV=1 with make (especially the --werror cross-space-execution) nvcc flag. I think we should also update centos7 docker image to run on cuda 10.1
EDIT: I still see issues for countsketch on 10.1 when run with memcheck but these seem to be still addressable issues with operator but this is different from 10.0 where multiple operators are impacted and seem to be difficult to address.
@marcoabreu @stu1130