dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.18k stars 2.99k forks source link

[GraphBolt][CUDA] Incremental GPU graph caching. #7470

Closed mfbalin closed 1 week ago

mfbalin commented 2 weeks ago

Description

Follow up PR will add the dataloader logic so that the cache can be utilized for faster GPU sampling.

Checklist

Please feel free to remove inapplicable items for your PR.

Changes

dgl-bot commented 2 weeks ago

To trigger regression tests:

dgl-bot commented 2 weeks ago

Commit ID: 0c0f1153d1794629431f7ef2bf83e079b88a2ce9

Build ID: 1

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

dgl-bot commented 2 weeks ago

Commit ID: 2a74efcf4471423a1dbd2fb047491697918a3631

Build ID: 2

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot commented 2 weeks ago

Commit ID: 8f170b925dbece270f353dc736263d4b3014b910

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot commented 2 weeks ago

Commit ID: 66c93363537009a0dbaad4106fc8ac68d18ef873

Build ID: 4

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

mfbalin commented 2 weeks ago

@frozenbugs let's land this PR and the followup PR before we release DGL 2.3. That way, we can say the new release has this major feature.

dgl-bot commented 2 weeks ago

Commit ID: bba8975405bf2c403818b91460b932d40e74e832

Build ID: 5

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot commented 2 weeks ago

Commit ID: 58302d4c3e8e666b16b02b75d33b4799a7db467f

Build ID: 6

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 2 weeks ago

Commit ID: daa9d3625907125fe07e88b17c9149d6a21980e6

Build ID: 7

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot commented 2 weeks ago

Commit ID: 3366909f672a832044030c7bf651d37a407ae49b

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 2 weeks ago

Commit ID: 58bfe3c52f83cd1cfbb92d95e0ff63877f646051

Build ID: 9

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 2 weeks ago

Commit ID: bade092543d4f6569e1670cce88bf88634bd2b8b

Build ID: 10

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

mfbalin commented 2 weeks ago

@dgl-bot

dgl-bot commented 2 weeks ago

Commit ID: bade092543d4f6569e1670cce88bf88634bd2b8b

Build ID: 11

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: d8fa4f01cab167e5d44989aed6a86270761617f4

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: d25147364b07945567ddb142bcb045d35775d347

Build ID: 13

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: bae7bcc9ee5520cfb7d6974a6a899692024b7368

Build ID: 14

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

mfbalin commented 1 week ago

@dgl-bot

dgl-bot commented 1 week ago

Commit ID: bae7bcc9ee5520cfb7d6974a6a899692024b7368

Build ID: 15

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

mfbalin commented 1 week ago

@dgl-bot

dgl-bot commented 1 week ago

Commit ID: bae7bcc9ee5520cfb7d6974a6a899692024b7368

Build ID: 16

Status: ❌ CI test failed in Stage [Torch CPU Example test].

Report path: link

Full logs path: link

mfbalin commented 1 week ago

@dgl-bot

dgl-bot commented 1 week ago

Commit ID: bae7bcc9ee5520cfb7d6974a6a899692024b7368

Build ID: 17

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

mfbalin commented 1 week ago

@dgl-bot

dgl-bot commented 1 week ago

Commit ID: bae7bcc9ee5520cfb7d6974a6a899692024b7368

Build ID: 18

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

mfbalin commented 1 week ago

@Rhett-Ying some test always ends up failing, the CI is too unstable.

mfbalin commented 1 week ago

@dgl-bot

dgl-bot commented 1 week ago

Commit ID: bae7bcc9ee5520cfb7d6974a6a899692024b7368

Build ID: 19

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 48221e2f9a40c4ab57059b8246db44f417112710

Build ID: 20

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: aeeb65f7f6887790e8403e8bd151c38bc6ae4914

Build ID: 21

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 2c4aef3f6421f7dd877b3452c777ba06e4645439

Build ID: 22

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: a81ca24b4d789c28a9587959edc698ca8bae7692

Build ID: 23

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: ed8d6b1d3c69e16503e1a8ed379649034ef536be

Build ID: 24

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 8b58ed5ba6bb7727f3da01414b87512a2ea784ad

Build ID: 25

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: fba294d64051e40584ba8440572e95f26e100b39

Build ID: 26

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 78eea0c056cee45c674e1e95a0a404f85236b890

Build ID: 27

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 7ff8213343bb3c1f5633bd3cfaee0b0b43e1380e

Build ID: 28

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: b2e10fc02afddf6364f4e89e2c396db47396a645

Build ID: 29

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: bb18427377fdc00e00f21895586008591e7294bc

Build ID: 30

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: bb18427377fdc00e00f21895586008591e7294bc

Build ID: 31

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

mfbalin commented 1 week ago

@Rhett-Ying CI failure output:

tests/distributed/test_mp_dataloader.py::test_edge_dataloader_homograph[False-self-False-0] Fatal Python error: Segmentation fault

Thread 0x00007fb4b8fdf700 (most recent call first):
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007fb4e5ce6800 (most recent call first):
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/codecs.py", line 322 in decode
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7470@2/python/dgl/partition.py", line 271 in get_peak_mem
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7470@2/python/dgl/distributed/partition.py", line 921 in partition_graph
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7470@2/tests/distributed/test_mp_dataloader.py", line 770 in check_dataloader
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7470@2/tests/distributed/test_mp_dataloader.py", line 914 in test_edge_dataloader_homograph
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 162 in pytest_pyfunc_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 1627 in runtest
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 173 in pytest_runtest_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 241 in <lambda>
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 240 in call_and_report
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 135 in runtestprotocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 364 in pytest_runtestloop
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 339 in _main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 285 in wrap_session
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 178 in main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 206 in console_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pytest/__main__.py", line 7 in <module>
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 86 in _run_code
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 196 in _run_module_as_main

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, dgl._ffi._cy3.core, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, psutil._psutil_linux, psutil._psutil_posix, pyarrow.lib, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._cdflib, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, yaml._yaml (total: 115)
tests/scripts/task_distributed_test.sh: line 37:   638 Segmentation fault      (core dumped) python3 -m pytest -v --capture=tee-sys --junitxml=pytest_distributed.xml --durations=100 tests/distributed/*.py
FAIL: distributed
Rhett-Ying commented 1 week ago

@mfbalin pls skip it. It's known issue

mfbalin commented 1 week ago

@mfbalin pls skip it. It's known issue

I can not skip it. I don't have the permissions to merge into master when the CI does not clear.

dgl-bot commented 1 week ago

Commit ID: 5ed6d0cf62c826aef5316ce7bafe830c292e4b1c

Build ID: 32

Status: ❌ CI test failed in Stage [Torch CPU Example test].

Report path: link

Full logs path: link

mfbalin commented 1 week ago

another test failure:

+ bash tests/scripts/task_example_test.sh cpu
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.0, pluggy-1.5.0 -- /opt/conda/envs/pytorch-ci/bin/python3
cachedir: .pytest_cache
rootdir: /home/ubuntu/jenkins/workspace/dgl_PR-7470
configfile: pyproject.toml
collecting ... collected 12 items

tests/examples/test_sampling_examples.py::test_node_classification PASSED [  8%]
tests/examples/test_sampling_examples.py::test_link_prediction PASSED    [ 16%]
tests/examples/test_sparse_examples.py::test_gcn PASSED                  [ 25%]
tests/examples/test_sparse_examples.py::test_gcnii PASSED                [ 33%]
tests/examples/test_sparse_examples.py::test_appnp PASSED                [ 41%]
tests/examples/test_sparse_examples.py::test_c_and_s PASSED              [ 50%]
tests/examples/test_sparse_examples.py::test_gat PASSED                  [ 58%]
tests/examples/test_sparse_examples.py::test_hgnn PASSED                 [ 66%]
tests/examples/test_sparse_examples.py::test_hypergraphatt PASSED        [ 75%]
tests/examples/test_sparse_examples.py::test_sgc PASSED                  [ 83%]
tests/examples/test_sparse_examples.py::test_sign FAILED                 [ 91%]
tests/examples/test_sparse_examples.py::test_twirls PASSED               [100%]

=================================== FAILURES ===================================
__________________________________ test_sign ___________________________________

    def test_sign():
        script = os.path.join(EXAMPLE_ROOT, "sign.py")
        out = subprocess.run(["python", str(script)], capture_output=True)
        assert (
            out.returncode == 0
        ), f"stdout: {out.stdout.decode('utf-8')}\nstderr: {out.stderr.decode('utf-8')}"
        stdout = out.stdout.decode("utf-8")
>       assert float(stdout[-5:]) > 0.7
E       AssertionError: assert 0.697 > 0.7
E        +  where 0.697 = float('.697\n')

tests/examples/test_sparse_examples.py:101: AssertionError
- generated xml file: /home/ubuntu/jenkins/workspace/dgl_PR-7470/pytest_backend.xml -
============================ slowest 100 durations =============================
34.74s call     tests/examples/test_sparse_examples.py::test_twirls
21.53s call     tests/examples/test_sparse_examples.py::test_gcnii
9.20s call     tests/examples/test_sparse_examples.py::test_hypergraphatt
8.06s call     tests/examples/test_sampling_examples.py::test_node_classification
7.80s call     tests/examples/test_sampling_examples.py::test_link_prediction
6.16s call     tests/examples/test_sparse_examples.py::test_hgnn
3.27s call     tests/examples/test_sparse_examples.py::test_appnp
3.22s call     tests/examples/test_sparse_examples.py::test_gat
3.18s call     tests/examples/test_sparse_examples.py::test_gcn
2.82s call     tests/examples/test_sparse_examples.py::test_sign
2.61s call     tests/examples/test_sparse_examples.py::test_sgc
2.55s call     tests/examples/test_sparse_examples.py::test_c_and_s

(24 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED tests/examples/test_sparse_examples.py::test_sign - AssertionError: assert 0.697 > 0.7
 +  where 0.697 = float('.697\n')
=================== 1 failed, 11 passed in 105.21s (0:01:45) ===================
FAIL: sparse examples on cpu
Rhett-Ying commented 1 week ago

I've just made a change on CI and rebased this PR. let's see if it works well now. If not, I will merge it if other CI tests pass.

mfbalin commented 1 week ago

I've just made a change on CI and rebased this PR. let's see if it works well now. If not, I will merge it if other CI tests pass.

The CI already passed: https://github.com/dmlc/dgl/pull/7470#issuecomment-2190496941 The only change I made is to improve a comment in the code. I would appreciate it if you merged it now.