artyom-beilis / pytorch_dlprim

DLPrimitives/OpenCL out of tree backend for pytorch
http://blog.dlprimitives.org/
MIT License
276 stars 17 forks source link

Updating documents and PyTorch compatibility for all systems? #77

Closed tangjinchuan closed 1 month ago

tangjinchuan commented 3 months ago

Hi Artyom, I tried with Apple Silicon M1, python 3.12, pytorch 2.3.1 with the following setting code:

torch.ops.load_library("/Users/tjc/Documents/libpt_ocl.dylib") torch.utils.rename_privateuse1_backend('ocl')

torch._register_device_module('ocl','opencl') # as required by Pytorch 2.0 ?

see Register new backend module to Pytorch https://pytorch.org/tutorials//advanced/privateuseone.html#register-new-backend-module-to-pytorch

Shall we /Could you please update the document for all platform ? I am happy to help. Meanwhile, it reports the following RuntimeError: Please register PrivateUse1HooksInterface by RegisterPrivateUse1HooksInterface first. I guess this is the new update in PyTorch 2.0?

Finally, my first test case on my own simple code has the following bug, do we have any agenda to implement the following :

/Users/tjc/PycharmProjects/pythonProject8/.venv/bin/python /Users/tjc/PycharmProjects/pythonProject8/main(2).py The True MI is 0.658541 Accessing device #0:Apple M1 on Apple /Users/tjc/PycharmProjects/pythonProject8/main(2).py:101: UserWarning: The operator 'aten::index.Tensor_out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (Triggered internally at /Users/tjc/Documents/pytorch_dlprim/src/tensor_ops.cpp:313.) Y_SHUFFLE = Y[torch.randperm(Y.size(0))] 0%| | 0/5000 [00:00<?, ?it/s] Traceback (most recent call last): File "/Users/tjc/PycharmProjects/pythonProject8/main(2).py", line 135, in loss.backward() # 从损失变量通过网络向后运行反向传播操作 ^^^^^^^^^^^^^^^ File "/Users/tjc/PycharmProjects/pythonProject8/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/Users/tjc/PycharmProjects/pythonProject8/.venv/lib/python3.12/site-packages/torch/autograd/init.py", line 288, in backward _engine_run_backward( File "/Users/tjc/PycharmProjects/pythonProject8/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Please register PrivateUse1HooksInterface by RegisterPrivateUse1HooksInterface first.

Process finished with exit code 1

Best wishes, Jinchuan

tangjinchuan commented 3 months ago

The test.py is fine. Therefore, the hook problem above is quite similar to https://github.com/artyom-beilis/pytorch_dlprim/issues/58

/Users/tjc/PycharmProjects/pythonProject8/.venv/bin/python /Users/tjc/Documents/pytorch_dlprim/test.py Accessing device #0:Apple M1 on Apple REF [[[ 0 137] [ 0 0] [255 255] [ 0 175]]

[[ 0 0] [ 0 255] [ 0 247] [ 0 128]]

[[ 19 0] [ 0 255] [ 88 0] [ 0 0]]] DEV [[[ 0 137] [ 0 0] [255 255] [ 0 175]]

[[ 0 0] [ 0 255] [ 0 247] [ 0 128]]

[[ 19 0] [ 0 255] [ 88 0] [ 0 0]]] 0.0

Process finished with exit code 0

artyom-beilis commented 3 months ago

Can you please try with pytorch 1.13... I know that there are some issues with newer ones. I think I checked 2.0

I need to do some serious testing on multiple versions of pytorch it is just too hard to keep up with them :-)

tangjinchuan commented 3 months ago

Yes, 1.13 is a pass. The speed of an MLP is only half of Apple's 'mps'. I guess I need to report the device info to you like last time I did with Intel Arc A770 16G so that you could tune it? Seems like the Apple silicon Max 2 on gemm.cpp did not work properly or the same ? https://github.com/artyom-beilis/pytorch_dlprim/issues/10#issuecomment-1892229711

/Users/tjc/PycharmProjects/pythonProject9/.venv/bin/python /Users/tjc/PycharmProjects/pythonProject9/aaa.py The True MI is 0.658629

A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0. Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to downgrade to 'numpy<2' or try to upgrade the affected module. We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last): File "/Users/tjc/PycharmProjects/pythonProject9/aaa.py", line 94, in model = Net().to(device) File "/Users/tjc/PycharmProjects/pythonProject9/aaa.py", line 83, in init self.fc1 = nn.Linear(1, H) # fc:fully connected File "/Users/tjc/PycharmProjects/pythonProject9/.venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in init self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs)) /Users/tjc/PycharmProjects/pythonProject9/.venv/lib/python3.10/site-packages/torch/nn/modules/linear.py:96: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:77.) self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs)) Accessing device #0:Apple M1 on Apple /Users/tjc/PycharmProjects/pythonProject9/aaa.py:102: UserWarning: The operator 'aten::index.Tensor_out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (Triggered internally at /Users/tjc/Documents/pytorch_dlprim310/src/tensor_ops.cpp:313.) Y_SHUFFLE = Y[torch.randperm(Y.size(0))] 100%|██████████| 5000/5000 [00:39<00:00, 127.34it/s] execution_time= 39.29358887672424 MINE= [-0.00492738 0.01242018 0.03254801 ... 0.68931293 0.71951771 0.72044754] True MI= [0.65862906 0.65862906 0.65862906 ... 0.65862906 0.65862906 0.65862906] execution_time= 39.29358887672424

For anyone who needs this working lib. libpt_ocl python310 pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 .zip

artyom-beilis commented 3 months ago

Started testing once again new pytorch versions and there are some basic build gaps...

I'm checking with dev-discuss to see the issues, for example:

find_package(Torch REQUIRED)

Fails under 2.3.1

tangjinchuan commented 3 months ago

I saw your post: https://discuss.pytorch.org/t/find-package-torch-required-fails-2-3-1-and-nightly/205248 I guess it is due to a self-compiled version of Pytorch. Installing stable pytorch via pip3 with Ubuntu24.04, Windows 11 and MAC OS latest M1 for me and my student has no problem.

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

More about the problem as well as the PR might be here: https://github.com/pytorch/pytorch/issues/118862

artyom-beilis commented 3 months ago

Actually it is exactly how I install the version. I install the official version. Since 1.13 the support of out of tree backend improved the way I don't need to touch any pytorch code.

tangjinchuan commented 3 months ago

Let me double-check, confused, why it is called pt_3.12_nightly in your file location?

tangjinchuan commented 3 months ago

For nightly, do you need to change it from : -DCMAKE_PREFIX_PATH=$VIRTUAL_ENV/lib/python3.12/site-packages/torch/share/cmake/Torch to -DCMAKE_PREFIX_PATH=$VIRTUAL_ENV/pt_3.12_nightly/lib/python3.12/site-packages/torch/share/cmake/Torch

OK, I have seen you updated cmake and solved it.

artyom-beilis commented 3 months ago

Ok it looks like most critical issues. I went over docs and lots had changed.

Need to do lots of fixes to make it all work (backend registration etc)

artyom-beilis commented 3 months ago

Currently I'm stuck on pytorch 2.3 & nightly with this issues: https://dev-discuss.pytorch.org/t/pytorch-out-of-tree-backend-updates-changes-question/2189/1

I hope I'll understand how to fix it with the developer's support

tangjinchuan commented 3 months ago

Hi , sorry for not seeing the message recently. Busy stuff including preparing a group official visa applications to visit Germany (Lower Saxony) next whole month, I can buy you beers if you are nearby.

I did mentioned "#torch._register_device_module('ocl','opencl') # as required by Pytorch 2.0 ?", but I was not able to figure out the most correct usage. It is always better to see the PyTorch community gave the correct answer.

artyom-beilis commented 3 months ago

Following this: https://dev-discuss.pytorch.org/t/find-package-torch-required-fails-2-3-1-and-nightly/2176/7

The module registration I got but from 2.3 there is a new interface that out of tree backend need to implement. I study it right now. To be on the safe side it is needed to stay on 1.13. Since in 2.2 there other issues that were fixed in 2.4 (like foreach operators not fallbacking to running one by one)

Once I implement the new interface and test it I hope pytorch OpenCL backend will work with 2.4 onwards.

I asked in the discussion to make sure such changes are published so backend developers can prepare in advance.

artyom-beilis commented 2 months ago

Now 2.4 works. Pytorch below 2.4 will fail due to lack of__foreach__ functions their support was fixed in 2.4

In 2.4 you need to call

        torch.utils.rename_privateuse1_backend('ocl')
        torch._register_device_module("ocl", object())

There are still more improvements needed buy 2.4 now works. All networks validated.

artyom-beilis commented 2 months ago

Closing

tangjinchuan commented 2 months ago

Thanks! Have arrived to Frankfurt the day before yesterday, Being visiting TU Clausthal for a month.

Best wishes, Jinchuan

On Tuesday 6 August 2024, Artyom Beilis @.***> wrote:

Now 2.4 works. Pytorch below 2.4 will fail due to lack of foreach functions their support was fixed in 2.4

In 2.4 you need to call

    torch.utils.rename_privateuse1_backend('ocl')
    torch._register_device_module("ocl", object())

There are still more improvements needed buy 2.4 now works. All networks validated.

— Reply to this email directly, view it on GitHub https://github.com/artyom-beilis/pytorch_dlprim/issues/77#issuecomment-2271393086, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQBUVG4GFD5U3GK7LDROWLZQDKMBAVCNFSM6AAAAABJWGSTG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRGM4TGMBYGY . You are receiving this because you authored the thread.Message ID: @.***>

artyom-beilis commented 1 month ago

Ok documents are updated. Closing the issue