ROCm / torch_migraphx

Libraries integrating migraphx with pytorch
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

[Issue]: New converter test causes core dump #149

Closed bpickrel closed 1 month ago

bpickrel commented 1 month ago

Problem Description

A new op converter and test lead to a crashing bug in pytest. Symptom is a core dump before any of the new debug output appears. Replication code is in branch nll_loss_converter_crash_bug

Commit 65ed2666e2ac doesn't display the failure; this displays only the expected test errors.

I suspect memory corruption not directly related to my code change because adding debug code caused the error to come and go erratically. At one point, a stack trace pointed to function MGXModule.__initialize in file torch_migraphx/py/torch_migraphx/fx/mgx_module.py but can't replicate this now.

Operating System

Ubuntu 20.04.6 LTS

CPU

AMD Ryzen Threadripper PRO 3955WX 16-Cores

GPU

AMD Radeon Pro W7900

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

Checkout branch nll_loss_converter_crash_bug cd torch_migraphx/tests pytest -k test_nll_loss_fx

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

This occurred in a docker container.

Output:

root@XXXXX:/workspace/torch_migraphx/tests# pytest -k test_nll_loss_fx
======================================================== test session starts =========================================================
platform linux -- Python 3.8.19, pytest-7.3.2, pluggy-1.5.0 -- /opt/conda/envs/py_3.8/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/workspace/torch_migraphx/tests/.hypothesis/examples')
rootdir: /workspace/torch_migraphx/tests
configfile: pytest.ini
plugins: flakefinder-1.1.0, xdist-3.3.1, rerunfailures-14.0, hypothesis-5.35.1, xdoctest-1.1.0, cpp-2.3.0
collected 653 items / 650 deselected / 3 selected                                                                                    

fx/converters/test_activations_fx.py::test_nll_loss_fx[inp_size=[3]-weight_size=0-reduction=mean] Fatal Python error: Aborted

Thread 0x00007feb537f6700 (most recent call first):
  File "/opt/conda/envs/py_3.8/lib/python3.8/threading.py", line 306 in wait
  File "/opt/conda/envs/py_3.8/lib/pytAborted (core dumped)
shivadbhavsar commented 1 month ago

There is a bug in the test_nll_loss_fx. Its hitting L30 (target_size = inp_size[:1] + inp_size[2:]) even when your input is size 1 which i dont think is intended. Your resulting target vector is of len 3 when it should be 1.

As an aside, there is a lot of conditional logic based on size. pytest.mark.parametrize is very good when the test case setup is essentially the same for all inputs. In this scenario I highly recommend splitting the 1-dim case into a separate test case to avoid this kind of logic flow error.

bpickrel commented 1 month ago

Resolved. The line described above isn't an error, but I found other closely related errors in the test.