Error in python tests/validate_network.py --device privateuseone:1

sukamenev commented 6 months ago

Tested on your original code

Testing  resnet18
Accessing device #1:AMD Radeon R9 Fury Series (radeonsi, fiji, LLVM 17.0.6, DRM 3.54, 6.6.12-calculate) on rusticl
LLVM ERROR: Cannot select: 0x7f3c70430b30: f32 = and 0x7f3c70424cc0, Constant:i32<2147483647>
  0x7f3c70424cc0: f32 = bitcast 0x7f3c7042ae70
    0x7f3c7042ae70: i32 = llvm.amdgcn.wwm TargetConstant:i64<2662>, 0x7f3c70424b00
      0x7f3c70430970: i64 = TargetConstant<2662>
      0x7f3c70424b00: i32 = llvm.amdgcn.readlane TargetConstant:i64<2528>, 0x7f3c7042bc00, Constant:i32<63>
        0x7f3c704254a0: i64 = TargetConstant<2528>
        0x7f3c7042bc00: i32,ch,glue = CopyFromReg # D:1 0x7f3c70425350, Register:i32 %367, 0x7f3c70425350:1
          0x7f3c70424da0: i32 = Register %367
          0x7f3c70425350: ch,glue = inlineasm # D:1 0x7f3c70424e10, TargetExternalSymbol:i64'; 4', MDNode:ch<null>, TargetConstant:i64<1>, TargetConstant:i32<1769482>, Register:i32 %367, TargetConstant:i32<-2147483639>, Register:i32 %368, 0x7f3c70424e10:1
            0x7f3c70424f60: i64 = TargetExternalSymbol'; 4'
            0x7f3c704303c0: i64 = TargetConstant<1>
            0x7f3c70424a20: i32 = TargetConstant<1769482>
            0x7f3c70424da0: i32 = Register %367
            0x7f3c704252e0: i32 = TargetConstant<-2147483639>
            0x7f3c7042b500: i32 = Register %368
            0x7f3c70424e10: ch,glue = CopyToReg # D:1 0x7f3c70430a50:1, Register:i32 %368, 0x7f3c7042b5e0
              0x7f3c7042b500: i32 = Register %368
              0x7f3c7042b5e0: i32 = bitcast # D:1 0x7f3c70424b70
                0x7f3c70424b70: f32 = fadd # D:1 0x7f3c704309e0, 0x7f3c7042b730
                  0x7f3c704309e0: f32 = fadd # D:1 0x7f3c70430200, 0x7f3c70425040

                  0x7f3c7042b730: f32 = bitcast # D:1 0x7f3c7042b0a0

        0x7f3c7042bb20: i32 = Constant<63>
  0x7f3c70430ac0: i32 = Constant<2147483647>
In function: main
Emergency stop

artyom-beilis commented 6 months ago

Is it 32 or 64 bit atchitecture? need to track down which kernel fails.

artyom-beilis commented 6 months ago

I also suggest to try AMD official drivers and not Mesa only.

I recall that for AMD 560 closed source drivers worked way better than Mesa ones. Also check of ROCm drivers still work on Fiji they are also better.

sukamenev commented 6 months ago

Is it 32 or 64 bit atchitecture? need to track down which kernel fails.

My CPU have 64 bit architecture. GCN 3 (Fiji) - I don't know how many bit architecture.

Quote from AMD docs:

Every instruction is described with either 32 bits or 64 bits of microcode. • Vector Memory instructions are 64 bits. • Exports are 64 bits. • LDS and GDS are 64 bits. • Scalar ALU instructions are 32 bits but can have an additional 32 bits of literal constant data. • Vector ALU instructions can be 32 bits or 64 bits. The 32-bit versions can have an additional 32 bits of literal constant data.

sukamenev commented 6 months ago

On AMD OpenCL from amdgpu-pro also error

python tests/validate_network.py --device privateuseone:3
Testing  resnet18
Accessing device #3:Fiji on AMD Accelerated Parallel Processing
Traceback (most recent call last):
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 280, in <module>
    main(r)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 221, in main
    train_on_images(m,batch,args.device,args.eval,iter_size = args.iter_size,opt_steps = args.opt,fwd=args.fwd)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 105, in train_on_images
    ref = step(model,data,labels,opt_steps,iter_size,fwd=fwd,test=test)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 85, in step
    loss.backward()
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: could not create a primitive descriptor iterator

sukamenev commented 6 months ago

I also suggest to try AMD official drivers and not Mesa only.

I recall that for AMD 560 closed source drivers worked way better than Mesa ones. Also check of ROCm drivers still work on Fiji they are also better.

Thank you! I got 8-9% speed impovement on amdgpu-pro OpenCL drivers.

artyom-beilis / pytorch_dlprim

Error in python tests/validate_network.py --device privateuseone:1 #70