dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.4k stars 3k forks source link

Structure S3-Hosted Wheels as PyPI Repository #7494

Closed reesehyde closed 2 months ago

reesehyde commented 3 months ago

🐛 Bug

When trying to construct a dgl.graphbolt.DataLoader in an environment supporting CUDA, the call to torch.ops.graphbolt.set_max_uva_threads() fails with an AttributeError

To Reproduce

From the environment described below, attempt to create a Graphbolt datapipe per the Node Classification with Minibatch Sampling tutorial. Note that while the environment supports CUDA, the error is produced even when the CPU is used:

from dgl import graphbolt as gb
import torch

device = torch.device("cpu")
dataset = gb.BuiltinDataset("ogbn-arxiv-seeds").load()
datapipe = gb.ItemSampler(dataset.tasks[0].train_set, batch_size=1024, shuffle=True)
datapipe = datapipe.sample_neighbor(dataset.graph, [4, 4])
datapipe = datapipe.copy_to(device)
datapipe = datapipe.fetch_feature(dataset.feature, node_feature_keys=["feat"])
dataloader = gb.DataLoader(datapipe)

This results in:

Traceback (most recent call last):
  File "/mnt/host_home/cash-identity-offline-graph-ml/hackweek/datapipe_bug.py", line 10, in <module>
    dataloader = gb.DataLoader(datapipe)
  File "/mnt/host_home/cash-identity-offline-graph-ml/hackweek/.venv/lib/python3.10/site-packages/dgl/graphbolt/dataloader.py", line 167, in __init__
    torch.ops.graphbolt.set_max_uva_threads(max_uva_threads)
  File "/mnt/host_home/cash-identity-offline-graph-ml/hackweek/.venv/lib/python3.10/site-packages/torch/_ops.py", line 822, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'graphbolt' object has no attribute 'set_max_uva_threads'

Expected behavior

DataLoader to be created successfully

Environment

Additional context

I can confirm the graphbolt shared library is present for my PyTorch version:

$ ls .venv/lib/python3.10/site-packages/dgl/graphbolt | grep $(python -c "from torch import __version__ as torchver; print(torchver[:torchver.rfind('+')])")
libgraphbolt_pytorch_2.2.1.so

I'm not sure how to check whether PyTorch is loading it correctly or at all.

Other Versions

Relatedly, my first reaction was to try to a different version of DGL and/or PyTorch. But I found that installing from PyPI on an x86-64 Linux machine I'm restricted to only using version 2.1.0 for v2. On PyPI the 2.0.0 wheel is only available for Linux aarch64, and no Linux wheels are available for 2.2.0 or 2.2.1. Could the CI/CD be updated to build more Linux wheels? I'd love to contribute there if someone could point me in the right direction!

mfbalin commented 3 months ago

Since you use the CPU as the device, you can pass overlap_feature_fetch=False to the DataLoader as a workaround.

mfbalin commented 3 months ago

I think the main issue is caused by you having probably installing the CPU version of DGL instead of CUDA. Can you tell us what is your installed DGL version? You can report the version by pip.

Rhett-Ying commented 3 months ago

@reesehyde pls refer to this page for DGL installation. This is the official page you should refer to only. As for pip packages, we host them on AWS S3 by our own. We only uploaded CPU versions to PyPI only and we stop uploading since DGL 2.2.0. So please always fetch pip packages from AWS S3.

reesehyde commented 3 months ago

Ah apologies, the problem was indeed using the CPU version! I just had plain old 2.1.0. Thank you @mfbalin and @Rhett-Ying for the help!

I managed to install this by downloading the correct wheel manually but have to fetch packages through a PyPI proxy. Would the team consider setting up the S3 bucket to be indexable by pip? I don't know exactly what that entails but looking through torch's bucket setup and testing some index urls, just hosting the repo.html file as a file called dgl might be sufficient? Then a fetch for dgl version 2.3.0 with index url https://data.dgl.ai/wheels/torch-2.3/cu118 would look for a version list file (repo.html) at https://data.dgl.ai/wheels/torch-2.3/cu118/dgl/.

mfbalin commented 3 months ago

Ah apologies, the problem was indeed using the CPU version! I just had plain old 2.1.0. Thank you @mfbalin and @Rhett-Ying for the help!

I managed to install this by downloading the correct wheel manually but have to fetch packages through a PyPI proxy. Would the team consider setting up the S3 bucket to be indexable by pip? I don't know exactly what that entails but looking through torch's bucket setup and testing some index urls, just hosting the repo.html file as a file called dgl might be sufficient? Then a fetch for dgl version 2.3.0 with index url https://data.dgl.ai/wheels/torch-2.3/cu118 would look for a version list file (repo.html) at https://data.dgl.ai/wheels/torch-2.3/cu118/dgl/.

Maybe you can update the issue title now that we know what is going wrong.

reesehyde commented 3 months ago

Thanks @mfbalin, updated the title to reflect the new request. I read up a bit more on hosting a simple PyPI repository and it does look like simply hosting an index file at the /dgl path should do the trick!

I'd be happy to create a PR for the update if someone could point me towards the S3-publish logic. I searched around in the repo for "repo.html" and "s3" but only found the CI/CD report and log uploads.

mfbalin commented 3 months ago

Thanks @mfbalin, updated the title to reflect the new request. I read up a bit more on hosting a simple PyPI repository and it does look like simply hosting an index file at the /dgl path should do the trick!

I'd be happy to create a PR for the update if someone could point me towards the S3-publish logic. The searched around in the repo for "repo.html" and "s3" but only found the CI/CD report and log uploads.

@Rhett-Ying What do you think? I don't understand much from PyPI or pip.

Rhett-Ying commented 2 months ago

@reesehyde could you show me the use case you want and the blocker? why current install command pip install dgl -f https://data.dgl.ai/wheels/torch-2.1/repo.html does not work for you? How would you like to install DGL? specify in a yaml?

reesehyde commented 2 months ago

Thanks @Rhett-Ying, I hadn't tried that command but you're right that it does the trick in pip — I wasn't aware of the -f HTML page instead of -i Python Package Index option in pip! The case I had in mind was essentially using -i rather than -f, which requires a proper PyPI index. This could be established by hosting the /repo.html file at /dgl and we could then use pip -i https://data.dgl.ai/wheels/torch-2.3/cu118 instead of pip -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html.

But I'm using Poetry rather than pip, and it seems my issue is simply due to a bug in Poetry. When specifying the /repo.html page as an a source URL the result is e.g.:

403 Client Error: Forbidden for url: https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html/dgl-2.3.0%2Bcu118-cp310-cp310-manylinux1_x86_64.whl

Poetry's Single Page Link Source forces /repo.html to a folder and then tries to build the relative link from it as /repo.html/file.whl. It is supposed to support the single HTML index page so I'll just fix the bug there — thank you both for getting me pointed in the right direction!