Memory Access Faults during training

twuebi commented 5 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Mint 19.1, kernel 4.15.0-47-generic
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): rocm2.3-tf1.13-python3 from docker
TensorFlow version (use command below): 1.13
Python version: 3.52
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
ROCm/MIOpen version: 2.3 from docker
GPU model and memory: Radeon VII, 16GB

Describe the current behavior I'm experiencing memory access faults like this:

Memory access fault by GPU node-1 (Agent handle: 0x14e4100) on address 0x7f8a068cb000. Reason: Page not present or supervisor privilege.

in various models. They usually happen during the first 20-45 minutes of training. I am experiencing them across a variety of environments (on upstream kernel 5.0 without rock-dkms, 4.15.0-47-generic with rock-dkms, the #316 docker and the rocm2.3-tf1.13-python3 docker). Due to these errors I couldn't finish training a network yet as they either crash with a shape related error (#325) or fail with the memory access fault.

Describe the expected behavior No memory errorrs.

Code to reproduce the issue This file contains the minimal code I could replicate the issue with. In my experience it takes up to 45 minutes until the error occurs.

Other info / logs I am attempting to reproduce the issue with

export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3
export HIP_TRACE_API=0x2

as per #302 but could not observe another instance yet, ~as the flags seem to be slowing things down quite a lot~, currently at step 10k with no crash, without the flags the crash happened at 8k.

Following #394 I also ran the HIP tests, where all but the native fp16 test pass.

rocm-dev.txt rocm_info.txt

twuebi commented 5 years ago

With

export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3

I could not observe any crashes. Without them, however, the program failed in two ways:

a memory access fault, the last op being:

<<hip-api pid:209 tid:4.1526 209 4.1526 hipLaunchKernel '_ZN10tensorflow14scatter_op_gpu21ScatterOpCustomKernelIfiLNS_10scatter_op8UpdateOpE2EEEvPT_PKS4_PKT0_S8_S8S8' gridDim:{18432,1,1} groupDim:{1024,1,1} sharedMem:+0 stream:0.1 @3416916724526

which is also the final op if the graph is only executed once by removing the while loop.

The program stops producing outputs and does not react to ctrl + c. It's last sign of life is the trace of the scatter op from above (a larger portion of a trace is in this file). After freezing, one thread keeps running at 100% cpu and rocm-smi shows 100% GPU while drawing only a fraction of the power (~50W vs. 60-250W while running).

I attached gdb to the main process and the 100% cpu thread, the output of the bt command is in the following files:

main-process-gdb-bt thread-gdb-bt

the thread returns to 100% after detaching, reattaching + bt produces the same output.

I've monitored the fan speed & temperature in rocm-smi for a few launches that all ended in a freeze:

The fan level stays at around 20% throughout the whole time, leading to temperatures between 100 and 110°C after a while and ultimately freezing which drops the temperature back to 60-70°C.

Fixing the fan level to 75% leads to a sustained temperature around 70°C and runs for more than 1 hour without a freeze. Resetting levels using rocm-smi --resetfans to driver control drops the fan-speed step by step down to 20.78%, raising the temperature to 100-110°C. After a few minutes at these temps the program becomes unresponsive.

Setting the SERIALIZE flags leads to a reduced GPU utilization (100% vs 70-100% with drops down to 40%) which in turn leads to a much slower increase and a lower peak of the temps (90°C-96°C).

@sebpuetz experiences the same fan behaviour (25% at 105°C). Fixing the fans to a higher speed, however, does not fix #325.

At least the freezes may then be caused by a driver issue where the fan-speeds are not adjusted?

sunway513 commented 5 years ago

Hi @twuebi , could you try to lower the GPU sclk and see if the issue is reproducible? /opt/rocm/bin/rocm-smi --setsclk 5 Besides, please provide the VBIOS version with the following command: /opt/rocm/bin/rocm-smi -v

sunway513 commented 5 years ago

BTW, please help provide the dmesg log if you see the system hanging again. That can provide more hints on the issue you are facing right now.

twuebi commented 5 years ago

Hi @twuebi , could you try to lower the GPU sclk and see if the issue is reproducible?

I'll give it a shot, I'd expect it to work but wouldn't see it as a satisfactory workaround.

So far, my observation were that the fans do not adjust to the GPU temperature leading to temps around 110°C which is the temperature range where the crashes occur. Fixing the fan speeds at 75% or setting the serialize flags both lead to lower GPU temps and I could not observe crashes. From that experience, my guess is that lowering the GPU sclk will work as long as the 20% fans will match the heat production at the lower sclk level.

Interestingly, the fans ramp up when running the benchmarks at https://github.com/tensorflow/benchmarks.

rocm-smi -v

GPU[0] : VBIOS version: 113-D3600200-105

BTW, please help provide the dmesg log if you see the system hanging again. That can provide more hints on the issue you are facing right now.

Will do the next time I observe it. To prevent a misunderstanding: the system does not hang, the program gets unresponsive and only terminates via kill.

twuebi commented 5 years ago

Setting sclk to 5 ran for >1h without issues at temperature ~80-90°C.

Resetting clocks to auto via rocm-smi -r led to an unresponsive program within 5 minutes while raising the temperature to 105-110°C.

The fan speed sat at 21.96% throughout the whole run.

dmesg

sunway513 commented 5 years ago

Thanks @twuebi , I agree the fan speed should be faster while the temp increases to the 105-110C range. Since you are on 105 VBIOS, could you upgrade to the latest 106 version and confirm if the issue is still there: https://www.amd.com/en/support/radeonvii-vbios-eula

sebpuetz commented 5 years ago

Fwiw, I'm getting the same fan behaviour on 106 VBIOS, so I wouldn't expect the BIOS upgrade to change that behaviour.

sunway513 commented 5 years ago

@sebpuetz , thanks for the heads up. @jlgreathouse , could you help comment on the fan issue and the path forward to resolve it?

twuebi commented 5 years ago

Upgraded the bios to 106. The fan behavior persists.

jlgreathouse commented 5 years ago

Please see my post here for more details about what may be happening with your fans. To wit, my first guess is: your GPU card vendor set the acoustic limit of your fans too low, and possibly set the thermal throttling limit of your GPU too high. As the GPU heats up, that temperature bleeds over to the HBM that's situated next to the GPU chip. As the HBM memory starts to get too hot, it starts to get corruptions before the periodic refresh cycle comes around -- and you end up with corrupted memory causing a corrupted pointer and thus a crash.

This is a hypothesis, however, and it will be a little difficult to test. However, could you answer the following questions? These aren't to try to lay blame on you, I just want to make sure of the system setup before I start doing any deep dives to try to solve the problem. :)

What GPU vendor is this from? You mentioned that it's a Radeon VII, but was this sold by e.g. PowerColor, Sapphire, ASRock, MSI, XFX, Gigabyte
Are you overclocking the card in any way? This includes using the OverDrive functionality available in rocm-smi.
Are you forcing frequencies in any way? This includes using rocm-smi --setsclk, or making any VBIOS modifications
When you've tested rock-dkms, which version was this? 2.2? 2.3?
Are you observing this problem directly after a reboot, or is there any pattern you follow to cause this problem to manifest?
Would you be willing to make custom driver modifications to help print out some debug values? See the last section of my response above about thermal throttling for the type of work this would entail.
Could I ask you to attach a copy of your pptable to this ticket? To get this, please open a root prompt and run cat /sys/class/drm/card0/device/pp_table > ~/radeon_vii_pp_table.bin. You may need to zip this (I don't remember github's attachment policies)

jlgreathouse commented 5 years ago

Hi @sebpuetz -- since you say you're also observing the problem, could you also give us info about your GPU vendor and potentially answer those same questions?

twuebi commented 5 years ago

Thanks for the extensive answer!

A weird thing about the fans - for the benchmarks at https://github.com/tensorflow/benchmarks they actually ramp up.

What GPU vendor is this from? You mentioned that it's a Radeon VII, but was this sold by e.g. PowerColor, Sapphire, ASRock, MSI, XFX, Gigabyte

Sapphire, same is true for @sebpuetz

Are you overclocking the card in any way? This includes using the OverDrive functionality available in rocm-smi.

Are you forcing frequencies in any way? This includes using rocm-smi --setsclk, or making any VBIOS modifications

No changes of the clock speeds.

When you've tested rock-dkms, which version was this? 2.2? 2.3?

I ran both 2.2 and 2.3, also a kernel 5 rc.

Are you observing this problem directly after a reboot, or is there any pattern you follow to cause this problem to manifest?

No patterns besides the temperature thing.

Would you be willing to make custom driver modifications to help print out some debug values? See the last section of my response above about thermal throttling for the type of work this would entail.

Sure!

Could I ask you to attach a copy of your pptable to this ticket? To get this, please open a root prompt and run cat /sys/class/drm/card0/device/pp_table > ~/radeon_vii_pp_table.bin. You may need to zip this (I don't remember github's attachment policies)

radeon_vii_pp_table.zip

jlgreathouse commented 5 years ago

OK, this doesn't appear to directly be a pp_table issue. I see the maximum fanspeed is 3850 RPM, and the acoustic limit is 2900 RPM.

You say you have two situations (one benchmark where the fan doesn't ramp up, one where it does). For both of these situations, could you show me the values in the following files?

/sys/class/hwmon/hwmon0/fan1_enable
/sys/class/hwmon/hwmon0/fan1_input
/sys/class/hwmon/hwmon0/fan1_max
/sys/class/hwmon/hwmon0/fan1_min
/sys/class/hwmon/hwmon0/fan1_target
/sys/class/hwmon/hwmon0/pwm1
/sys/class/hwmon/hwmon0/pwm1_enable
/sys/class/hwmon/hwmon0/pwm1_max
/sys/class/hwmon/hwmon0/pwm1_min

twuebi commented 5 years ago

You say you have two situations (one benchmark where the fan doesn't ramp up, one where it does). For both of these situations, could you show me the values in the following files?

I have my tensorflow models and those contained in the benchmark repository, for my own models like https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/files/3090116/transformer_model.py.txt the fans don't ramp up. Sometimes they can be around 30% when starting the program the temperatures then increase while the fan speed gradually decreases to a minimum of 20.xx-21.96%.

For the benchmarks the fan speeds do actually ramp up.

hwmon0 didn't have the files, I assume it's hwmon2 for me.

No ramp:

tobi@tobi-desktop:~$ for x in $(cat fans); do echo $x; cat $x; done;
/sys/class/hwmon/hwmon2/fan1_enable
0
/sys/class/hwmon/hwmon2/fan1_input
951
/sys/class/hwmon/hwmon2/fan1_max
3850
/sys/class/hwmon/hwmon2/fan1_min
0
/sys/class/hwmon/hwmon2/fan1_target
951
/sys/class/hwmon/hwmon2/pwm1
61
/sys/class/hwmon/hwmon2/pwm1_enable
2
/sys/class/hwmon/hwmon2/pwm1_max
255
/sys/class/hwmon/hwmon2/pwm1_min
0
tobi@tobi-desktop:~$ for x in $(cat fans); do echo $x; cat $x; done;
/sys/class/hwmon/hwmon2/fan1_enable
0
/sys/class/hwmon/hwmon2/fan1_input
931
/sys/class/hwmon/hwmon2/fan1_max
3850
/sys/class/hwmon/hwmon2/fan1_min
0
/sys/class/hwmon/hwmon2/fan1_target
931
/sys/class/hwmon/hwmon2/pwm1
61
/sys/class/hwmon/hwmon2/pwm1_enable
2
/sys/class/hwmon/hwmon2/pwm1_max
255
/sys/class/hwmon/hwmon2/pwm1_min
0

Ramp

tobi@tobi-desktop:~$ for x in $(cat fans); do echo $x; cat $x; done;
/sys/class/hwmon/hwmon2/fan1_enable
0
/sys/class/hwmon/hwmon2/fan1_input
846
/sys/class/hwmon/hwmon2/fan1_max
3850
/sys/class/hwmon/hwmon2/fan1_min
0
/sys/class/hwmon/hwmon2/fan1_target
846
/sys/class/hwmon/hwmon2/pwm1
53
/sys/class/hwmon/hwmon2/pwm1_enable
2
/sys/class/hwmon/hwmon2/pwm1_max
255
/sys/class/hwmon/hwmon2/pwm1_min
0
tobi@tobi-desktop:~$ for x in $(cat fans); do echo $x; cat $x; done;
/sys/class/hwmon/hwmon2/fan1_enable
0
/sys/class/hwmon/hwmon2/fan1_input
1736
/sys/class/hwmon/hwmon2/fan1_max
3850
/sys/class/hwmon/hwmon2/fan1_min
0
/sys/class/hwmon/hwmon2/fan1_target
1736
/sys/class/hwmon/hwmon2/pwm1
114
/sys/class/hwmon/hwmon2/pwm1_enable
2
/sys/class/hwmon/hwmon2/pwm1_max
255
/sys/class/hwmon/hwmon2/pwm1_min
0
tobi@tobi-desktop:~$ for x in $(cat fans); do echo $x; cat $x; done;
/sys/class/hwmon/hwmon2/fan1_enable
0
/sys/class/hwmon/hwmon2/fan1_input
1960
/sys/class/hwmon/hwmon2/fan1_max
3850
/sys/class/hwmon/hwmon2/fan1_min
0
/sys/class/hwmon/hwmon2/fan1_target
1960
/sys/class/hwmon/hwmon2/pwm1
127
/sys/class/hwmon/hwmon2/pwm1_enable
2
/sys/class/hwmon/hwmon2/pwm1_max
255
/sys/class/hwmon/hwmon2/pwm1_min
0
tobi@tobi-desktop:~$ for x in $(cat fans); do echo $x; cat $x; done;
/sys/class/hwmon/hwmon2/fan1_enable
0
/sys/class/hwmon/hwmon2/fan1_input
2133
/sys/class/hwmon/hwmon2/fan1_max
3850
/sys/class/hwmon/hwmon2/fan1_min
0
/sys/class/hwmon/hwmon2/fan1_target
2133
/sys/class/hwmon/hwmon2/pwm1
140
/sys/class/hwmon/hwmon2/pwm1_enable
2
/sys/class/hwmon/hwmon2/pwm1_max
255
/sys/class/hwmon/hwmon2/pwm1_min
0

jlgreathouse commented 5 years ago

Yeah, the hwmon system changes around depending on what other things are in your system. Just to verify that hwmon2 is the right device, could you show the output of /sys/class/hwmon/hwmon2/device/device? I expect it will be 0x66af, which is the PCIe ID for the Radeon VII.

Could you also show me the following:

lspci
cat /sys/class/hwmon/hwmon2/device/ppfeatures
dmesg | grep "command line"

In addition, for those "no ramp" and "ramp" cases, can you also read /sys/class/hwmon/hwmon2/temp1_input along with the rest of the data? Sorry to ask you to grab that again.

sebpuetz commented 5 years ago

I had some time to log the values you requested over the course of about 10 minutes for the "no ramp" and "ramp" cases.

Both files contain tab seperated columns for: temp fan_enable fan1_input fan1_max fan1_min fan1_target pwm1 pwm1_enable pwm1_max pwm1_min

The rows were printed in 1-second intervals.

no_ramp.txt, logged while running the transformer_model.py script from above.
ramp.txt logged while running the following benchmark: TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16 --num_epochs 200. The benchmark did not run for the full 200 epochs, I stopped the script after about 10 minutes.

Yeah, the hwmon system changes around depending on what other things are in your system. Just to verify that hwmon2 is the right device, could you show the output of /sys/class/hwmon/hwmon2/device/device? I expect it will be 0x66af, which is the PCIe ID for the Radeon VII.

On my system that's the case, hwmon2/device/device shows 0x66af.

lspci.txt ppfeatures.txt

dmesg | grep "command line"
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-18-generic root=UUID=24308ee8-65de-4f1f-b8ae-65856197876b ro

twuebi commented 5 years ago

Yeah, the hwmon system changes around depending on what other things are in your system. Just to verify that hwmon2 is the right device, could you show the output of /sys/class/hwmon/hwmon2/device/device? I expect it will be 0x66af, which is the PCIe ID for the Radeon VII.

 /sys/class/hwmon/hwmon2/device/device
0x66af

lspci ppfeatures

dmesg | grep "command line"
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-47-generic root=UUID=67797474-2609-4d32-bb98-ab804e619335 ro quiet splash vt.handoff=1

python transformer_model.py no_ramp.txt

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception4 ramp.txt

twuebi commented 5 years ago

I did some more experiments.

Besides general model architecture, the difference between the benchmarks and my models is the way the graphs are executed. In my custom models, the session is executed within a python loop using a feed dictionary. In the benchmarks, the inputs are fed using input queues. The usage of queues ensures that the next batch is always ready, minimizing idle GPU time (AFAIK in some cases the next batch is already prefetched to the GPU). With the feed dictionary, the GPU idles between each session call in python.

My hypothesis is now that the idle time between the session calls somehow prevents the fans from being adjusted to the temperature.

To test this I altered my original model so that it only features a single session call with an infinite while loop in the graph (inf_transformer_model.py.txt). The result is a fan that adjusts to the temperature: ramp_transformer.txt.

Another observation I made which may be the reason for the behavior is that the avg. power consumption in the feed dictionary models fluctuates between 50 and 200W while it seems to be much more consistent with the non feed dictionary models.

edit 2:

I forgot to alter the header in the following files, the last column is the output of:

cat '/sys/class/hwmon/hwmon2/power1_average'

transformer_model.py.txt no_ramp_avg_power.txt

inf_transformer_model.py.txt: ramp_avg_power.txt

edit:

For the sake of completeness tf_benchmark with power consumption:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception4 inception_ramp_avg_power.txt

twuebi commented 5 years ago

Any suggestions on how to proceed?

jlgreathouse commented 5 years ago

Hi @twuebi

If this is holding up your work, I would say that for now your best bet would be to manually set fan speeds using rocm-smi. We are working inside AMD to set up a reproducer for this and to target the bug report to the proper team. This won't be instantaneous, as this bug looks like it may cross multiple layers.

twuebi commented 5 years ago

Following https://github.com/RadeonOpenCompute/ROCm/issues/564#issuecomment-428069668, I checked the FanTargetTemperature by uncommenting dump_pptable and its use in check_powerplay_tables in amd/powerplay/hwmgr/vega20_processpptables.c. Apparently it is set to 95 degrees. Setting it to 75 in smum_send_msg_to_smc_with_parameter in vega20_thermal_setup_fan_table in amd/powerplay/hwmgr/vega20_thermal.c solves the issue.

My theory is now that my vendor set the FanTargetTemperature way too high and the pauses between the calls either drop the temperature enough to prevent the fan from spinning or that the drops in energy consumption tricks the card into not spinning the fans. During the calls the temperature may then spike into the critical area where the memory access faults occur.

ROCm / tensorflow-upstream

Memory Access Faults during training #414

No ramp:

Ramp