ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.05k stars 218 forks source link

[Jenkins] Segmentation Fault in test_conv_extra config #226

Open daniellowell opened 4 years ago

daniellowell commented 4 years ago

bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0

and

bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 1, 1, 1, 1 --weights 1, 1, 3, 3 --pads_strides_dilations 1 1 2 2 2 1 --trans_output_pads 0 0

Failing configs on our Jenkins CI, gfx908. Will temporarily disable until resolved.

Disabled in: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/228

atamazov commented 4 years ago

Analysis of a failing case

For now we are saving binary in the same thread where is is being built. If we build several kernels, then saving may happen concurrently. Assembling is very fast, so it is more likely that concurrency would occur after assembly than after OCL or HIP build.

In the attached logs, three asm kernels are built in parallel (download and open in diff tool for details):

The failure happens just after build, prior any [PrepareInvoker] and [EvaluateInvokers] calls, most likely during [SaveBinary]. Suspicious fragments shown below. Good one:

MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-1979-aa97-3310-4d0d/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-a828-0239-1dce-3d85/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906 - -o /tmp/miopen-tmp-75ab-f316-ef3b-746b/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file 
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: conv_3x3_wheel_alpha_v9_0_15.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 7e-05 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v16_5_0_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.000561 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v20_5_23_M_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx906
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.00012 ms
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: miopenSp3AsmConvRxSf3x2
MIOpen(HIP): Info2 [GetSolution]  N=8 C=128 H=28 W=28 K=128 n_groups=60 flags=7 R=1 S=1 pad_H=0 pad_W=0 out_H=28 out_W=28
MIOpen(HIP): Info2 [GetSolution] ...flags=519 d_N_stride=401408 d_C_stride=3136 f_K_stride=4 f_C_stride=512 o_N_stride=401408 o_K_stride=3136
MIOpen(HIP): Info [EvaluateInvokers] ConvBinWinogradRxSf3x2: miopenSp3AsmConvRxSf3x2: 0.092 < 3.40282e+38
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: miopenSp3AsmConv_group_20_5_23_M_stride1

Bad:

MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-58f3-cdab-b8aa-e17f/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-c90c-4e51-5e89-1c82/amdgcn-asm-out-XXXXXX'
MIOpen(HIP): Info2 [AmdgcnAssemble] ' -x assembler -target amdgcn--amdhsa -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908 - -o /tmp/miopen-tmp-e7c4-5f56-e91f-1565/amdgcn-asm-out-XXXXXX'
UndefinedBehaviorSanitizer:DEADLYSIGNAL
==25716==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f5ac7d40501 bp 0x7f5a939fd9b0 sp 0x7f5a939fd8e0 T25728)
==25716==The signal is caused by a READ memory access.
==25716==Hint: address points to the zero page.
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file 
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: conv_3x3_wheel_alpha_v9_0_15.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.001303 ms
MIOpen(HIP): Info2 [SaveBinary] Saving binary for: Conv_Winograd_v16_5_0_stride1.s ;args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -mcpu=gfx908
MIOpen(HIP): Info2 [Measure] Db::RemoveRecord time: 0.000237 ms
    #0 0x7f5ac7d40501  (/opt/rocm/bin/../lib/libhip_hcc.so+0x19501)
    #1 0x7f5ac7d3a29b  (/opt/rocm/bin/../lib/libhip_hcc.so+0x1329b)
    #2 0x7f5ac7dc146d  (/opt/rocm/bin/../lib/libhip_hcc.so+0x9a46d)
    #3 0x7f5ac7dc2918  (/opt/rocm/bin/../lib/libhip_hcc.so+0x9b918)
    #4 0x7f5acc3aefc9  (/var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/build/lib/libMIOpen.so.1+0x3ec5fc9)
    #5 0x7f5acc3c112d  (/var/jenkins/workspace/en_wrw-igemm-v4r4xdlops-fp32-fix/build/lib/libMIOpen.so.1+0x3ed812d)
... 

Note suspicious extra MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file and MIOpen(HIP): Info [KernDb] database not present in the failing log (however these may relate to something else).

I tend to think that the reason is that SQLite binary cache is not fully MT safe yet. The issues disappear after switching to a file-based binary cache.

Perhaps it is enough to stop removing the binary from the cache when cache is disabled here: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/3ed5e8569ec82ea2d5f8706d279e2074ceeb8a5a/src/binary_cache.cpp#L163-L164

However, even if this would resolve the problem, I seems worth to clearly identify the root cause of the issue first.

/cc @JehandadKhan

JehandadKhan commented 4 years ago

@atamazov There is no harm in removing that line, however, please note that since the case at hand is an empty database it would return from the call immediately here

aserio commented 4 years ago

This issue can be closed upon PR #240 being merged.

atamazov commented 4 years ago

This is not so anymore, I just restored #226 in the #240 because of this failure during "Full long tests / FP32 gfx908 Hip Release All subset":

../bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0 
MIOpen(HIP): Warning [Prefetch] File is unreadable: /var/jenkins/workspace/Libs_MIOpen_fix-jenkins-failures/src/kernels/gfx90878.HIP.fdb.txt
error: 4.71152e-08
Max diff: 0.00146484
Mismatch at 0: 1309 != 1309
Forward convolution: fft
Input tensor: 4, 64, 14, 14
Weights tensor: 24, 64, 5, 5
Output tensor: 
Filter: conv2d, miopenConvolution, miopenPaddingDefault, {2, 2}, {1, 1}, {1, 1}, 
../bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0 
error: 0
Max diff: 0
Backward convolution: ConvBinWinogradRxS
Input tensor: 4, 64, 14, 14
Weights tensor: 24, 64, 5, 5
Output tensor: 4, 24, 14, 14
Filter: conv2d, miopenConvolution, miopenPaddingDefault, {2, 2}, {1, 1}, {1, 1}, 
../bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 4, 64, 14, 14 --weights 24, 64, 5, 5 --pads_strides_dilations 2 2 1 1 1 1 --trans_output_pads 0 0 
Memory access fault by GPU node-4 (Agent handle: 0x1dbb0f0) on address 0x7f9f09444000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
test/CMakeFiles/test_conv_extra.dir/build.make:57: recipe for target 'test/CMakeFiles/test_conv_extra' failed

Needs to investigate.

aserio commented 4 years ago

@atamazov, do you have a status update on this issue?

atamazov commented 4 years ago

No.

aserio commented 4 years ago

@daniellowell, please assign someone to this blocking issue.

aserio commented 4 years ago

@daniellowell, do you have an update on this issue?

aserio commented 4 years ago

Only reproducible in Jenkins afaik

atamazov commented 4 years ago

@aserio Why priority changed from blocker to unknown?

aserio commented 4 years ago

@atamazov, In both cases @daniellowell suggested the de-escalation. In the last change (from priority_high to priority_unknown) Daniel noted that we are unable to reproduce the issue outside of Jenkins. Feel free to re-prioritize the ticket if you have some insight here!

atamazov commented 4 years ago

I see. I am concluding that @daniellowell is thinking that the reason of our inability to reproduce these errors is "unusual" (of somewhat incorrect) Jenkins environment (that is not vanilla ROCm). This sounds reasonable. Let's assign low_priority. Eventually, after some Jenkins upgrade, we shall re-enable these tests ans see what happens.

atamazov commented 4 years ago

If we take the above as a current hypothesis, then this is not a bug.

ppanchad-amd commented 5 months ago

Hi, Is this still reproducible on Jenkins, if not, can we close this bug?

atamazov commented 5 months ago

We need to assign someone and investigate. Do not close.