ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
683 stars 93 forks source link

Internal: Failed to enqueue async memset operation: HIP_ERROR_InvalidValue #947

Open lucianolorenti opened 4 years ago

lucianolorenti commented 4 years ago

Hi! I cannot run a minimal example

It is because the my GPU is not supported?

System information

-MIOpen: miopen-2.3.0 -HIP: The version tagged with ROCm 3.3.0

Describe the current behavior Every time I call the fit method it finishes with the following error:

2020-04-29 01:07:16.372850: E tensorflow/stream_executor/stream.cc:5929] Internal: Failed to enqueue async memset operation: HIP_ERROR_InvalidValue
2020-04-29 01:07:16.372894: I tensorflow/stream_executor/stream.cc:317] did not allocate timer: 0x7f5fbfffdf20
2020-04-29 01:07:16.372902: I tensorflow/stream_executor/stream.cc:2186] [stream=0x55613341ada0,impl=0x55613341b2c0] did not enqueue 'start timer': 0x7f5fbfffdf20
2020-04-29 01:07:16.372946: I tensorflow/stream_executor/stream.cc:2198] [stream=0x55613341ada0,impl=0x55613341b2c0] did not enqueue 'stop timer': 0x7f5fbfffdf20
2020-04-29 01:07:16.372967: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr 

Describe the expected behavior Run without crashing Standalone code to reproduce the issue https://www.tensorflow.org/tutorials/quickstart/beginner?hl=en

ekuznetsov139 commented 4 years ago

Can you post the complete output?

lucianolorenti commented 4 years ago

Sure!

After creating the model:

020-04-29 20:42:13.635622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libhip_hcc.so
2020-04-29 20:42:13.736471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMD GPU ISA: gfx803
coreClock: 1.23GHz coreCount: 32 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 98.35GiB/s

2020-04-29 20:42:13.796267: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library librocblas.so
2020-04-29 20:42:13.849208: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libMIOpen.so
2020-04-29 20:42:13.893434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library librocfft.so
2020-04-29 20:42:13.910734: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library librocrand.so
2020-04-29 20:42:13.910845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1685] Adding visible gpu devices: 0
2020-04-29 20:42:13.943421: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3601000000 Hz
2020-04-29 20:42:13.943707: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56326ec9c890 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-29 20:42:13.943725: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-29 20:42:13.943869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMD GPU ISA: gfx803
coreClock: 1.23GHz coreCount: 32 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 98.35GiB/s
2020-04-29 20:42:13.943916: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library librocblas.so
2020-04-29 20:42:13.943934: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libMIOpen.so
2020-04-29 20:42:13.943949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library librocfft.so
2020-04-29 20:42:13.943963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library librocrand.so
2020-04-29 20:42:13.944013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1685] Adding visible gpu devices: 0
2020-04-29 20:42:13.944066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-29 20:42:13.944077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 
2020-04-29 20:42:13.944084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N 
2020-04-29 20:42:13.944181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1229] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3540 MB memory) -> physical GPU (device: 0, name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], pci bus id: 0000:01:00.0)
2020-04-29 20:42:14.176568: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56326fd7c4c0 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
2020-04-29 20:42:14.176601: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Ellesmere [Radeon RX 470/480/570/570X/580/580X/590], AMDGPU ISA version: gfx803

After calling fit:


Epoch 1/5
2020-04-29 20:42:49.779373: I tensorflow/core/graph/gpu_fusion_pass.cc:505] ROCm Fusion is enabled.
2020-04-29 20:42:49.794449: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library librocblas.so
2020-04-29 20:42:52.636045: E tensorflow/stream_executor/stream.cc:5929] Internal: Failed to enqueue async memset operation: HIP_ERROR_InvalidValue
2020-04-29 20:42:52.636263: I tensorflow/stream_executor/stream.cc:317] did not allocate timer: 0x7f98ca7faf20
2020-04-29 20:42:52.636348: I tensorflow/stream_executor/stream.cc:2186] [stream=0x56326fcc8000,impl=0x56326fcc8520] did not enqueue 'start timer': 0x7f98ca7faf20
2020-04-29 20:42:52.636479: I tensorflow/stream_executor/stream.cc:2198] [stream=0x56326fcc8000,impl=0x56326fcc8520] did not enqueue 'stop timer': 0x7f98ca7faf20
2020-04-29 20:42:52.636578: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr 
ekuznetsov139 commented 4 years ago

Try to run /opt/rocm/bin/rocm-smi and /opt/rocm/bin/rocminfo and post their outputs.

lucianolorenti commented 4 years ago
ekuznetsov139 commented 4 years ago

The setup looks in order. I suspect that the GPU is the problem - it is theoretically supported, but, in practice, newer releases are not tested on gfx803 GPUs and therefore YMMV.

Let's try two more things:

lucianolorenti commented 4 years ago

:( HIP_TRACE_API_14.txt

I know that around July of last year I could make tensorflow works with this GPU in my PC. Perhaps some change in the rocm stack broke it for me.

jerryyin commented 4 years ago

@ekuznetsov139 Thanks for taking the first look at the issue, I'm assigning the issue to you now. In case you need the machine to reproduce, we can sync offline or talk with @sunway513

ekuznetsov139 commented 4 years ago

I have been able to run the tutorial script without errors on a Radeon RX 480 with ROCm 3.3. It's the same chip except with 36 CUs instead of 32.

You could try to load the docker image rocm/tensorflow:rocm3.3-tf2.1-dev and see if it works there.

Also, what's the exact commit hash of the version of miopen that you compiled?

lucianolorenti commented 4 years ago

Thank you very much for your help.

In my pc originally I was using the following commit for MIOpen: 869a484773f59a79eb53eac9fa9233e52d80b3c3 now I tried with 4c1b0ca4987eb5931d694b1be73137267861c369 the last one so far, but I got the same error about HIP_ERROR_InvalidValue.


These are my devices in the host

crw-rw----+ 1 root video 238, 0 may 16 13:29 /dev/kfd
total 0
drwxr-xr-x  2 root root         80 may 16 13:29 by-path
crw-rw----+ 1 root video  226,   0 may 16 13:29 card0
crw-rw-rw-  1 root render 226, 128 may 16 13:29 renderD128

Docker image

In order to be able to run rocminfo with the least amount of errors I installed kmod and I added the render group, and I changed the group ownership for the devices. I attach the rocminfo output rocminfo_docker.txt

rocm-smi

GPU  Temp   AvgPwr  SCLK    MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
0    47.0c  9.047W  466Mhz  300Mhz  31.76%  auto  105.0W    5%   0%

tensorflow in python

when I ran the test code, the GPU is not being recognized and it fallbacks to the CPU

2020-05-16 11:45:37.232898: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libhip_hcc.so
2020-05-16 11:45:37.249434: E tensorflow/stream_executor/rocm/rocm_driver.cc:975] could not retrieve ROCM device count: HIP_ERROR_NoDevice
2020-05-16 11:45:37.253050: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3601000000 Hz
2020-05-16 11:45:37.253289: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3fe5170 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-16 11:45:37.253317: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-16 11:45:37.254101: E tensorflow/stream_executor/rocm/rocm_driver.cc:975] could not retrieve ROCM device count: HIP_ERROR_NoDevice

With another user different from root I got the following error:

ROCk module is loaded
Failed to get user name to check for video group membership
hsa api call failure at: /data/jenkins-workspace/compute-rocm-rel-3.3/rocminfo/rocminfo.cc:1102
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

Jajaj I suppose my stack installation is more broken than what I initially thought

ekuznetsov139 commented 4 years ago

Over here, /dev/kfd is crw-rw-rw- 1 root root 241, 0 May 15 17:08 /dev/kfd

According to your rocminfo_docker.txt, rocminfo does not see any GPUs.

The typical command line to launch docker is

sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined

The HSA_STATUS_ERROR_OUT_OF_RESOURCES error often means that the user does not have the right access rights to /dev/kfd.

lucianolorenti commented 4 years ago

Hi! Apparently I had a problem with the kernel module, I recompile it and now tensorflow is working in the docker image rocm/tensorflow:rocm3.3-tf2.1-dev . In the host, the problem still persists.

There is a way to know which versions of the rocm stack are the ones compiled in the docker image in order to reproduce it in my machine?