artyom-beilis / pytorch_dlprim

DLPrimitives/OpenCL out of tree backend for pytorch
http://blog.dlprimitives.org/
MIT License
264 stars 17 forks source link

Crash when trying to use pytorch/glow, which was built on pytorch/opencl #7

Open teena3 opened 2 years ago

teena3 commented 2 years ago

Hi,

I am trying to use pytorch/glow with OPENCL backend enabled. I want to compare inference time on GPU for pytorch with glow enabled/disabled, thus I built pytorch with opencl as instructed in this repo.

Crash is not observed when model and data are not copied to GPU in infer_glow() via something.to('opencl:0') i.e. the below lines are commented

lowered_model = lowered_model.to(device=device)
inputs = inputs.to(device=device)

Could you please help me understand the issue, following is the gdb backtrace:

  [New Thread 0x7fff74ffd700 (LWP 6990)]
  **malloc_consolidate(): invalid chunk size**

  Thread 1 "python3.7" received signal SIGABRT, Aborted.
  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
  51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
  (gdb) bt
  #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
  #1  0x00007ffff7a227f1 in __GI_abort () at abort.c:79
  #2  0x00007ffff7a6b837 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7b98a7b "%s\n") at ../sysdeps                  /posix/libc_fatal.c:181
  #3  0x00007ffff7a728ba in malloc_printerr (str=str@entry=0x7ffff7b9a2d8 "malloc_consolidate(): invalid chunk size") at malloc.c:5342
  #4  0x00007ffff7a72b5e in malloc_consolidate (av=av@entry=0x7ffff7dcdc40 <main_arena>) at malloc.c:4471
  #5  0x00007ffff7a76848 in _int_malloc (av=av@entry=0x7ffff7dcdc40 <main_arena>, bytes=bytes@entry=4096) at malloc.c:3713
  #6  0x00007ffff7a792ad in __GI___libc_malloc (bytes=4096) at malloc.c:3075
  #7  0x00007fffe09fc150 in traced_realloc () from /usr/local/lib/python3.7/dist-packages/pandas/_libs/hashtable.cpython-37m-x86_64-linux-gnu.so
  #8  0x00007fffe09fc44b in ?? () from /usr/local/lib/python3.7/dist-packages/pandas/_libs/hashtable.cpython-37m-x86_64-linux-gnu.so
  #9  0x00007fffe09fefbc in ?? () from /usr/local/lib/python3.7/dist-packages/pandas/_libs/hashtable.cpython-37m-x86_64-linux-gnu.so
  #10 0x0000000000588f15 in ?? ()

gdb_bt.txt

uname -a Linux ip-192-168-1-210 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Commit id's of the setup: dlprimitives: 6eb5794aec7b48fe2e2b8d1fa7b1eab712d72d87 pytorch-dlprim: 7ec2e47cd56fdad86e08d3aff65f7c35fc89b575 pytorch: eb74af18af6e90ae47f24997af8468bf7b9deb72 glow: cda5383b1609ebad1a3631ca77b41b8a863443d4 Built glow with few adaptations as above pytorch commit was bit older: git_diff.txt clinfo: clinfo.txt

Python Code: opencl_pytorch_glow.txt As I was not able to upload .py here, thus converted it to .txt

I also tried to use:

  traced_m = torch.jit.trace(resnet.to('opencl:0'), (x.to('opencl:0')))

I am facing the below error:

  torch.jit._trace.TracingCheckError: Tracing failed sanity checks!
  encountered an exception while running the trace with test inputs.
  Exception:
        Unknown device for graph fuser

Please let me know, if you need any more information.

artyom-beilis commented 2 years ago

Few things:

  1. I have never used glow and don't really know how it and its role - so it is quite hard for me to understand the example.
  2. Can you create simplest example probably with simplest op (like 1-2 fully connected layers) so I can reproduce.

traced_m = torch.jit.trace(resnet.to('opencl:0'), (x.to('opencl:0')) ... Unknown device for graph fuser

Probably some other case that need to know device or something else. I must say opencl backend is really in early stages. So there many things that likely not going to work and need to be fixed.

Also I get different error when resnet is resnet50 or 18:

NotImplementedError: Could not run 'aten::isnan' with arguments from the 'PrivateUse1' backend

Which is not implemented yet.