dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.36k stars 3k forks source link

[GraphBolt] Option to use `DiskBasedFeature` to `OnDiskDataset.load()` #7541

Closed mfbalin closed 1 month ago

mfbalin commented 1 month ago

Description

Updating the example as well so that we can run the newly added feature.

Tested with

python examples/graphbolt/pyg/labor/node_classification.py --num-gpu-cached-features=1000000 --num-cpu-cached-features=40000000 --batch-dependency=128 --cpu-feature-cache-policy=s3-fifo --dataset=ogbn-papers100M
Details

Also testing with right now

python examples/graphbolt/pyg/labor/node_classification.py --num-gpu-cached-features=1000000 --num-cpu-cached-features=40000000 --sample-mode=sample_neighbor --cpu-feature-cache-policy=s3-fifo --dataset=ogbn-papers100M
Details

root@a100cse:/localscratch/dgl-3/examples/graphbolt/pyg/labor# python node_classification.py --num-gpu-cached-features=1000000 --num-cpu-cached-features=40000000 --sample-mode=sample_neighbor --cpu-feature-cache-policy=s3-fifo --dataset=ogbn-papers100M Training in pinned-pinned-cuda mode. Loading data... The dataset is already preprocessed. /localscratch/dgl-3/python/dgl/graphbolt/impl/torch_based_feature_store.py:523: GBWarning: `DiskBasedFeature.pin_memory_()` is not supported. Leaving unmodified. gb_warning( Training: 1178it [00:51, 22.88it/s, num_nodes=664303, gpu_cache_miss=0.74, cpu_cache_miss=0.0601] Evaluating: 123it [00:06, 19.13it/s, num_nodes=247768, gpu_cache_miss=0.742, cpu_cache_miss=0.0566] Epoch 00, Loss: 1.8027, Approx. Train: 0.4881, Approx. Val: 0.5606, Time: 51.47872185707092s Training: 1178it [00:27, 43.31it/s, num_nodes=671640, gpu_cache_miss=0.741, cpu_cache_miss=0.0352] Evaluating: 123it [00:02, 46.13it/s, num_nodes=247890, gpu_cache_miss=0.742, cpu_cache_miss=0.0342] Epoch 01, Loss: 1.5694, Approx. Train: 0.5406, Approx. Val: 0.6130, Time: 27.1996328830719s Training: 1178it [00:27, 43.13it/s, num_nodes=650851, gpu_cache_miss=0.741, cpu_cache_miss=0.0265] Evaluating: 123it [00:02, 46.58it/s, num_nodes=247941, gpu_cache_miss=0.742, cpu_cache_miss=0.0261] Epoch 02, Loss: 1.4664, Approx. Train: 0.5640, Approx. Val: 0.6248, Time: 27.315482139587402s Training: 1178it [00:28, 42.04it/s, num_nodes=672185, gpu_cache_miss=0.741, cpu_cache_miss=0.0222] Evaluating: 123it [00:02, 46.28it/s, num_nodes=245190, gpu_cache_miss=0.742, cpu_cache_miss=0.022] Epoch 03, Loss: 1.4041, Approx. Train: 0.5785, Approx. Val: 0.6396, Time: 28.020687580108643s Training: 1178it [00:26, 44.20it/s, num_nodes=662923, gpu_cache_miss=0.741, cpu_cache_miss=0.0196] Evaluating: 123it [00:02, 47.08it/s, num_nodes=246961, gpu_cache_miss=0.742, cpu_cache_miss=0.0195] Epoch 04, Loss: 1.3608, Approx. Train: 0.5889, Approx. Val: 0.6508, Time: 26.64904808998108s Training: 1178it [00:27, 42.87it/s, num_nodes=660661, gpu_cache_miss=0.741, cpu_cache_miss=0.0178] Evaluating: 123it [00:02, 45.50it/s, num_nodes=248101, gpu_cache_miss=0.742, cpu_cache_miss=0.0178] Epoch 05, Loss: 1.3287, Approx. Train: 0.5967, Approx. Val: 0.6551, Time: 27.479353427886963s Training: 1178it [00:27, 42.36it/s, num_nodes=682160, gpu_cache_miss=0.741, cpu_cache_miss=0.0166] Evaluating: 123it [00:02, 47.29it/s, num_nodes=248931, gpu_cache_miss=0.742, cpu_cache_miss=0.0165] Epoch 06, Loss: 1.3030, Approx. Train: 0.6031, Approx. Val: 0.6461, Time: 27.811164379119873s Training: 1178it [00:26, 43.87it/s, num_nodes=672394, gpu_cache_miss=0.741, cpu_cache_miss=0.0156] Evaluating: 123it [00:03, 38.69it/s, num_nodes=247806, gpu_cache_miss=0.742, cpu_cache_miss=0.0156] Epoch 07, Loss: 1.2823, Approx. Train: 0.6083, Approx. Val: 0.6531, Time: 26.850144624710083s Training: 1178it [00:27, 42.81it/s, num_nodes=657184, gpu_cache_miss=0.741, cpu_cache_miss=0.0149] Evaluating: 123it [00:02, 46.64it/s, num_nodes=248337, gpu_cache_miss=0.742, cpu_cache_miss=0.0148] Epoch 08, Loss: 1.2648, Approx. Train: 0.6127, Approx. Val: 0.6615, Time: 27.518917083740234s Training: 1178it [00:26, 44.40it/s, num_nodes=664141, gpu_cache_miss=0.741, cpu_cache_miss=0.0143] Evaluating: 123it [00:02, 46.06it/s, num_nodes=246160, gpu_cache_miss=0.742, cpu_cache_miss=0.0143] Epoch 09, Loss: 1.2497, Approx. Train: 0.6165, Approx. Val: 0.6743, Time: 26.533616542816162s Training: 1178it [00:27, 42.55it/s, num_nodes=664595, gpu_cache_miss=0.741, cpu_cache_miss=0.0138] Evaluating: 123it [00:02, 47.20it/s, num_nodes=245216, gpu_cache_miss=0.742, cpu_cache_miss=0.0138] Epoch 10, Loss: 1.2368, Approx. Train: 0.6198, Approx. Val: 0.6746, Time: 27.68705940246582s Training: 1178it [00:27, 42.68it/s, num_nodes=673480, gpu_cache_miss=0.742, cpu_cache_miss=0.0134] Evaluating: 123it [00:02, 47.17it/s, num_nodes=247561, gpu_cache_miss=0.742, cpu_cache_miss=0.0134] Epoch 11, Loss: 1.2255, Approx. Train: 0.6227, Approx. Val: 0.6753, Time: 27.599354028701782s Training: 1178it [00:26, 44.55it/s, num_nodes=669096, gpu_cache_miss=0.742, cpu_cache_miss=0.013] Evaluating: 123it [00:02, 47.01it/s, num_nodes=247495, gpu_cache_miss=0.742, cpu_cache_miss=0.013] Epoch 12, Loss: 1.2154, Approx. Train: 0.6252, Approx. Val: 0.6706, Time: 26.443865537643433s Training: 1178it [00:27, 42.60it/s, num_nodes=679177, gpu_cache_miss=0.742, cpu_cache_miss=0.0127] Evaluating: 123it [00:02, 45.81it/s, num_nodes=246314, gpu_cache_miss=0.742, cpu_cache_miss=0.0127] Epoch 13, Loss: 1.2063, Approx. Train: 0.6276, Approx. Val: 0.6777, Time: 27.651339054107666s Training: 1178it [00:27, 43.35it/s, num_nodes=653335, gpu_cache_miss=0.742, cpu_cache_miss=0.0125] Evaluating: 123it [00:02, 47.32it/s, num_nodes=247131, gpu_cache_miss=0.742, cpu_cache_miss=0.0125] Epoch 14, Loss: 1.1980, Approx. Train: 0.6297, Approx. Val: 0.6775, Time: 27.172462224960327s Training: 1178it [00:28, 40.85it/s, num_nodes=649114, gpu_cache_miss=0.742, cpu_cache_miss=0.0123] Evaluating: 123it [00:02, 47.36it/s, num_nodes=246728, gpu_cache_miss=0.742, cpu_cache_miss=0.0123] Epoch 15, Loss: 1.1906, Approx. Train: 0.6316, Approx. Val: 0.6740, Time: 28.84046173095703s Training: 1178it [00:27, 42.88it/s, num_nodes=663281, gpu_cache_miss=0.742, cpu_cache_miss=0.0121] Evaluating: 123it [00:02, 47.38it/s, num_nodes=247175, gpu_cache_miss=0.742, cpu_cache_miss=0.0121] Epoch 16, Loss: 1.1838, Approx. Train: 0.6333, Approx. Val: 0.6712, Time: 27.473856687545776s Training: 1178it [00:27, 43.50it/s, num_nodes=668569, gpu_cache_miss=0.742, cpu_cache_miss=0.0119] Evaluating: 123it [00:02, 47.28it/s, num_nodes=246266, gpu_cache_miss=0.742, cpu_cache_miss=0.0119] Epoch 17, Loss: 1.1774, Approx. Train: 0.6350, Approx. Val: 0.6774, Time: 27.08340549468994s Training: 1178it [00:26, 44.40it/s, num_nodes=661165, gpu_cache_miss=0.742, cpu_cache_miss=0.0117] Evaluating: 123it [00:02, 47.12it/s, num_nodes=247866, gpu_cache_miss=0.742, cpu_cache_miss=0.0117] Epoch 18, Loss: 1.1716, Approx. Train: 0.6365, Approx. Val: 0.6777, Time: 26.530193567276s Training: 1178it [00:27, 42.76it/s, num_nodes=661945, gpu_cache_miss=0.742, cpu_cache_miss=0.0116] Evaluating: 123it [00:02, 47.55it/s, num_nodes=244619, gpu_cache_miss=0.742, cpu_cache_miss=0.0116] Epoch 19, Loss: 1.1662, Approx. Train: 0.6379, Approx. Val: 0.6811, Time: 27.5500590801239s Training: 1178it [00:26, 43.81it/s, num_nodes=668503, gpu_cache_miss=0.742, cpu_cache_miss=0.0115] Evaluating: 123it [00:02, 47.48it/s, num_nodes=246921, gpu_cache_miss=0.742, cpu_cache_miss=0.0115] Epoch 20, Loss: 1.1611, Approx. Train: 0.6392, Approx. Val: 0.6821, Time: 26.8914053440094s Training: 1178it [00:27, 42.78it/s, num_nodes=652926, gpu_cache_miss=0.742, cpu_cache_miss=0.0113] Evaluating: 123it [00:02, 46.06it/s, num_nodes=246958, gpu_cache_miss=0.742, cpu_cache_miss=0.0114] Epoch 21, Loss: 1.1563, Approx. Train: 0.6405, Approx. Val: 0.6799, Time: 27.536324739456177s Training: 1178it [00:27, 43.24it/s, num_nodes=678499, gpu_cache_miss=0.742, cpu_cache_miss=0.0112] Evaluating: 123it [00:02, 47.46it/s, num_nodes=246747, gpu_cache_miss=0.742, cpu_cache_miss=0.0112] Epoch 22, Loss: 1.1519, Approx. Train: 0.6416, Approx. Val: 0.6856, Time: 27.241077184677124s Training: 1178it [00:27, 43.56it/s, num_nodes=672612, gpu_cache_miss=0.742, cpu_cache_miss=0.0111] Evaluating: 123it [00:02, 46.31it/s, num_nodes=246751, gpu_cache_miss=0.742, cpu_cache_miss=0.0111] Epoch 23, Loss: 1.1477, Approx. Train: 0.6427, Approx. Val: 0.6832, Time: 27.046754598617554s Training: 1178it [00:26, 44.11it/s, num_nodes=639682, gpu_cache_miss=0.742, cpu_cache_miss=0.011] Evaluating: 123it [00:03, 39.93it/s, num_nodes=247373, gpu_cache_miss=0.742, cpu_cache_miss=0.0111] Epoch 24, Loss: 1.1438, Approx. Train: 0.6437, Approx. Val: 0.6813, Time: 26.70728063583374s Training: 1178it [00:27, 42.98it/s, num_nodes=667361, gpu_cache_miss=0.742, cpu_cache_miss=0.011] Evaluating: 123it [00:02, 47.60it/s, num_nodes=247172, gpu_cache_miss=0.742, cpu_cache_miss=0.011] Epoch 25, Loss: 1.1401, Approx. Train: 0.6447, Approx. Val: 0.6780, Time: 27.40790557861328s Training: 1178it [00:26, 44.15it/s, num_nodes=657206, gpu_cache_miss=0.742, cpu_cache_miss=0.0109] Evaluating: 123it [00:02, 46.82it/s, num_nodes=246593, gpu_cache_miss=0.742, cpu_cache_miss=0.0109] Epoch 26, Loss: 1.1365, Approx. Train: 0.6456, Approx. Val: 0.6858, Time: 26.682600498199463s Training: 1178it [00:27, 42.98it/s, num_nodes=662702, gpu_cache_miss=0.742, cpu_cache_miss=0.0108] Evaluating: 123it [00:02, 47.51it/s, num_nodes=247157, gpu_cache_miss=0.742, cpu_cache_miss=0.0108] Epoch 27, Loss: 1.1332, Approx. Train: 0.6464, Approx. Val: 0.6811, Time: 27.411181926727295s

Checklist

Please feel free to remove inapplicable items for your PR.

Changes

dgl-bot commented 1 month ago

To trigger regression tests:

dgl-bot commented 1 month ago

Commit ID: b290b749d3c04a41836bafcbd7b0aa5baf6106df

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: a74b4ac8c83f04df5be13263f3b17abdd42c946c

Build ID: 2

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 29e65bd40ab16c80bd790bc9dc99f90e060f2a70

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 7afffbc837c0d954e3995f9771176bff76f5cdaf

Build ID: 4

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 197f11d8bc576df2d60d713e7f8d4bbb2909549a

Build ID: 5

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 6ad89b525b4f941ad8dd14713188c5a53e984a03

Build ID: 6

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 2bb427d4e7cdfeb3c5099d10bed597c18ae2e577

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

mfbalin commented 1 month ago

@frozenbugs We need to be able to let the users use DiskBasedFeature if we want them to test MAG240M example with GPUCachedFeature->CPUCachedFeature->DiskBasedFeature. Would be nice to see how it performs in our R-GCN example. @Rhett-Ying

dgl-bot commented 1 month ago

Commit ID: d916c5f05ffff507a54ba353e62036d75e2af522

Build ID: 8

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: cd476d98465a8aba895906abc58bcfb107e68a27

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 6cbc85f5a6e97a2dcbd71c224422ab0cb8612843

Build ID: 10

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 14a34c2ef69d3d35ff43f1d3ff25841e4653abbb

Build ID: 11

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: c715a724555533b6462269110f5b367b557a8823

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 4dc5706e89587f452b0cd4daa8dae7bc16671035

Build ID: 13

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 24e9247819c872736dfdb7edd55952eb1d4085aa

Build ID: 14

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: a5c2b842f0c3156d51a3c9dc5c3f9d3f3c02f64c

Build ID: 15

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: edbad67c1da456b82090a5b8509694e7c4035b53

Build ID: 16

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: ca7d89966513d7a099f08de7d8df7c72f0975c87

Build ID: 17

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 3ccacbb93bbfb04005d7d3fd1f7b452743ae3fe2

Build ID: 18

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 3ecb2e491696ec13b75ef24bf52133626d2d7974

Build ID: 19

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: d98ce2bfe305f5326502f48e0c72fb744654f720

Build ID: 20

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 0f965ac93a6c230768b9971e2be18e5e6eab21d0

Build ID: 21

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: a55be90a72d5fca228c77e5c352f699fab167944

Build ID: 22

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 8e724d01a531a1d38aafd0319bcdd344651f56e4

Build ID: 23

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: a4ac4d264b6f4fe3afae9cff9e3e4caad2b38066

Build ID: 24

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 64a234e7b925562b47a308b8b996ad5b567d69b3

Build ID: 25

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: d839f9d13fd7dadd19650a7a8c252173dc6cc360

Build ID: 26

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 month ago

Commit ID: 5120493dbe6f53a84a547ceda6222f94f84f9d55

Build ID: 27

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link