IntelLabs / matsciml

Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.
MIT License
141 stars 19 forks source link

[Feature request]: PyG installation instructions (esp. for XPUs) #166

Closed chaitjo closed 3 months ago

chaitjo commented 5 months ago

Feature/behavior summary

I'm trying to get PyG to install and work well with Intel XPUs, and was hoping to use this repository as reference. At present, I see that PyG is never installed by default, and nor are any instructions for setting it up with XPUs available.

Request attributes

Related issues

No response

Solution description

Unknown.

Additional notes

At present, working with a different repository (https://github.com/a-r-j/ProteinWorkshop), I've been trying to integrate your code for the XPU as a new accelerator in PyTorch Lightning: https://github.com/IntelLabs/matsciml/blob/main/matsciml/lightning/xpu.py.

So far, I'm able to get my trainer to identify the XPU as a device, but it seems like some torch_cluster operations are not compatible with tensor stored on XPUs. I would like to perform torch_cluster operations such as knn graph creation on XPU tensors so that I can do data processing in a batched manner or on-the-fly, as opposed to on the CPU.

Here is a minimal example which fails:

import torch
import intel_extension_for_pytorch as ipex
from torch_geometric.nn import knn_graph

device = torch.device('xpu:0' if torch.xpu.is_available() else 'cpu')

x = torch.tensor([[-1.0, -1.0], [-1.0, 1.0], [1.0, -1.0], [1.0, 1.0]]).to(device)
batch = torch.tensor([0, 0, 0, 0]).to(device)
edge_index = knn_graph(x, k=2, batch=batch, loop=False)

The resulting error is RuntimeError: x.device().is_cpu() INTERNAL ASSERT FAILED at "csrc/cpu/knn_cpu.cpp":12, please report a bug to PyTorch. x must be CPU tensor.

And here's a longer trace from the ProteinWorkshop codebase, which probably won't make any sense to MatSciML maintainers.

File "/home/ckj24/rds/hpc-work/envs/proteinworkshop/lib/python3.10/site-packages/torch_geometric/nn/pool/__init__.py", line 171, in knn_graph
    return torch_cluster.knn_graph(x, k, batch, loop, flow, cosine,
  File "/home/ckj24/rds/hpc-work/envs/proteinworkshop/lib/python3.10/site-packages/torch_cluster/knn.py", line 132, in knn_graph
    edge_index = knn(x, x, k if loop else k + 1, batch, batch, cosine,
  File "/home/ckj24/rds/hpc-work/envs/proteinworkshop/lib/python3.10/site-packages/torch_cluster/knn.py", line 81, in knn
    return torch.ops.torch_cluster.knn(x, y, ptr_x, ptr_y, k, cosine,
  File "/home/ckj24/rds/hpc-work/envs/proteinworkshop/lib/python3.10/site-packages/torch/_ops.py", line 692, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: x.device().is_cpu() INTERNAL ASSERT FAILED at "csrc/cpu/knn_cpu.cpp":12, please report a bug to PyTorch. x must be CPU tensor
laserkelvin commented 5 months ago

Thanks for bringing this up! That's a good point, I think we've been taking a lot of the dependencies for granted and we'll update the documentation.

Nominally, PyG since a few versions ago, a lot of the PyG core functionality has been upstreamed to be PyTorch (e.g. torch_scatter stuff), but not everything; that means for the most part, PyG by itself should work out of the box on XPUs, however functionality that exists outside - torch_scatter, torch_cluster, torch_sparse - aren't supported yet. So the error you're seeing is basically the low level implementation for knn_graph only exists for CUDA or for CPUs, and it's expecting a tensor that resides on the latter.

I'm not 100% sure what our plans are for supporting those supplementary libraries, and so they might need to be treated on a case-by-case basis. Please reach out to me via email or Slack and we can discuss this further (even if it's not matsciml related). I'll keep this issue up still, since I agree we do need to update our PyG + XPU instructions.

chaitjo commented 5 months ago

Thanks!

What's the current recommended way to installing PyG?

I'm currently using:

pip install torch_geometric
pip install torch-scatter torch-cluster

..and this seems fine unless I need some of the functions from torch-cluster to be run on tensors which are located on XPUs. PyG's doc also states regarding torch-scatter and torch-cluster that these packages 'come with their own CPU and GPU kernel implementations based on the PyTorch C++/CUDA/hip(ROCm) extension interface.' So I suppose there's no real fix yet for my particular usecase apart from shifting my computation to the CPU.

laserkelvin commented 5 months ago

Those pip commands should work. If you are super paranoid, you can tack on --no-cache-dirs to make sure you're not using a cached version, and also --no-binary :all: to make sure it's built from source. If you have issues, I'd suggest you step through those :)

I've brought up torch_cluster support internally on some things we can potentially do, but will require some time. I'll send you an email separately.

laserkelvin commented 3 months ago

@chaitjo do you think I can close this issue?

198 updated the README, and I think it should be pretty complete - within the bounds of the current status of broader framework support

chaitjo commented 3 months ago

Yes please.

On Wed, 29 May 2024 at 4:30 PM, Kelvin Lee @.***> wrote:

@chaitjo https://github.com/chaitjo do you think I can close this issue?

198 https://github.com/IntelLabs/matsciml/pull/198 updated the README,

and I think it should be pretty complete - within the bounds of the current status of broader framework support

— Reply to this email directly, view it on GitHub https://github.com/IntelLabs/matsciml/issues/166#issuecomment-2137699413, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUNYNIGUATPA5OXDECC2N3ZEXYCBAVCNFSM6AAAAABFGEFOHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZXGY4TSNBRGM . You are receiving this because you were mentioned.Message ID: @.***>