Closed mfbalin closed 1 month ago
To trigger regression tests:
@dgl-bot run [instance-type] [which tests] [compare-with-branch]
;
For example: @dgl-bot run g4dn.4xlarge all dmlc/master
or @dgl-bot run c5.9xlarge kernel,api dmlc/master
@frozenbugs We need to be able to let the users use DiskBasedFeature
if we want them to test MAG240M example with GPUCachedFeature->CPUCachedFeature->DiskBasedFeature
. Would be nice to see how it performs in our R-GCN example. @Rhett-Ying
Description
Updating the example as well so that we can run the newly added feature.
Tested with
Details
Also testing with right now
Details
root@a100cse:/localscratch/dgl-3/examples/graphbolt/pyg/labor# python node_classification.py --num-gpu-cached-features=1000000 --num-cpu-cached-features=40000000 --sample-mode=sample_neighbor --cpu-feature-cache-policy=s3-fifo --dataset=ogbn-papers100M Training in pinned-pinned-cuda mode. Loading data... The dataset is already preprocessed. /localscratch/dgl-3/python/dgl/graphbolt/impl/torch_based_feature_store.py:523: GBWarning: `DiskBasedFeature.pin_memory_()` is not supported. Leaving unmodified. gb_warning( Training: 1178it [00:51, 22.88it/s, num_nodes=664303, gpu_cache_miss=0.74, cpu_cache_miss=0.0601] Evaluating: 123it [00:06, 19.13it/s, num_nodes=247768, gpu_cache_miss=0.742, cpu_cache_miss=0.0566] Epoch 00, Loss: 1.8027, Approx. Train: 0.4881, Approx. Val: 0.5606, Time: 51.47872185707092s Training: 1178it [00:27, 43.31it/s, num_nodes=671640, gpu_cache_miss=0.741, cpu_cache_miss=0.0352] Evaluating: 123it [00:02, 46.13it/s, num_nodes=247890, gpu_cache_miss=0.742, cpu_cache_miss=0.0342] Epoch 01, Loss: 1.5694, Approx. Train: 0.5406, Approx. Val: 0.6130, Time: 27.1996328830719s Training: 1178it [00:27, 43.13it/s, num_nodes=650851, gpu_cache_miss=0.741, cpu_cache_miss=0.0265] Evaluating: 123it [00:02, 46.58it/s, num_nodes=247941, gpu_cache_miss=0.742, cpu_cache_miss=0.0261] Epoch 02, Loss: 1.4664, Approx. Train: 0.5640, Approx. Val: 0.6248, Time: 27.315482139587402s Training: 1178it [00:28, 42.04it/s, num_nodes=672185, gpu_cache_miss=0.741, cpu_cache_miss=0.0222] Evaluating: 123it [00:02, 46.28it/s, num_nodes=245190, gpu_cache_miss=0.742, cpu_cache_miss=0.022] Epoch 03, Loss: 1.4041, Approx. Train: 0.5785, Approx. Val: 0.6396, Time: 28.020687580108643s Training: 1178it [00:26, 44.20it/s, num_nodes=662923, gpu_cache_miss=0.741, cpu_cache_miss=0.0196] Evaluating: 123it [00:02, 47.08it/s, num_nodes=246961, gpu_cache_miss=0.742, cpu_cache_miss=0.0195] Epoch 04, Loss: 1.3608, Approx. Train: 0.5889, Approx. Val: 0.6508, Time: 26.64904808998108s Training: 1178it [00:27, 42.87it/s, num_nodes=660661, gpu_cache_miss=0.741, cpu_cache_miss=0.0178] Evaluating: 123it [00:02, 45.50it/s, num_nodes=248101, gpu_cache_miss=0.742, cpu_cache_miss=0.0178] Epoch 05, Loss: 1.3287, Approx. Train: 0.5967, Approx. Val: 0.6551, Time: 27.479353427886963s Training: 1178it [00:27, 42.36it/s, num_nodes=682160, gpu_cache_miss=0.741, cpu_cache_miss=0.0166] Evaluating: 123it [00:02, 47.29it/s, num_nodes=248931, gpu_cache_miss=0.742, cpu_cache_miss=0.0165] Epoch 06, Loss: 1.3030, Approx. Train: 0.6031, Approx. Val: 0.6461, Time: 27.811164379119873s Training: 1178it [00:26, 43.87it/s, num_nodes=672394, gpu_cache_miss=0.741, cpu_cache_miss=0.0156] Evaluating: 123it [00:03, 38.69it/s, num_nodes=247806, gpu_cache_miss=0.742, cpu_cache_miss=0.0156] Epoch 07, Loss: 1.2823, Approx. Train: 0.6083, Approx. Val: 0.6531, Time: 26.850144624710083s Training: 1178it [00:27, 42.81it/s, num_nodes=657184, gpu_cache_miss=0.741, cpu_cache_miss=0.0149] Evaluating: 123it [00:02, 46.64it/s, num_nodes=248337, gpu_cache_miss=0.742, cpu_cache_miss=0.0148] Epoch 08, Loss: 1.2648, Approx. Train: 0.6127, Approx. Val: 0.6615, Time: 27.518917083740234s Training: 1178it [00:26, 44.40it/s, num_nodes=664141, gpu_cache_miss=0.741, cpu_cache_miss=0.0143] Evaluating: 123it [00:02, 46.06it/s, num_nodes=246160, gpu_cache_miss=0.742, cpu_cache_miss=0.0143] Epoch 09, Loss: 1.2497, Approx. Train: 0.6165, Approx. Val: 0.6743, Time: 26.533616542816162s Training: 1178it [00:27, 42.55it/s, num_nodes=664595, gpu_cache_miss=0.741, cpu_cache_miss=0.0138] Evaluating: 123it [00:02, 47.20it/s, num_nodes=245216, gpu_cache_miss=0.742, cpu_cache_miss=0.0138] Epoch 10, Loss: 1.2368, Approx. Train: 0.6198, Approx. Val: 0.6746, Time: 27.68705940246582s Training: 1178it [00:27, 42.68it/s, num_nodes=673480, gpu_cache_miss=0.742, cpu_cache_miss=0.0134] Evaluating: 123it [00:02, 47.17it/s, num_nodes=247561, gpu_cache_miss=0.742, cpu_cache_miss=0.0134] Epoch 11, Loss: 1.2255, Approx. Train: 0.6227, Approx. Val: 0.6753, Time: 27.599354028701782s Training: 1178it [00:26, 44.55it/s, num_nodes=669096, gpu_cache_miss=0.742, cpu_cache_miss=0.013] Evaluating: 123it [00:02, 47.01it/s, num_nodes=247495, gpu_cache_miss=0.742, cpu_cache_miss=0.013] Epoch 12, Loss: 1.2154, Approx. Train: 0.6252, Approx. Val: 0.6706, Time: 26.443865537643433s Training: 1178it [00:27, 42.60it/s, num_nodes=679177, gpu_cache_miss=0.742, cpu_cache_miss=0.0127] Evaluating: 123it [00:02, 45.81it/s, num_nodes=246314, gpu_cache_miss=0.742, cpu_cache_miss=0.0127] Epoch 13, Loss: 1.2063, Approx. Train: 0.6276, Approx. Val: 0.6777, Time: 27.651339054107666s Training: 1178it [00:27, 43.35it/s, num_nodes=653335, gpu_cache_miss=0.742, cpu_cache_miss=0.0125] Evaluating: 123it [00:02, 47.32it/s, num_nodes=247131, gpu_cache_miss=0.742, cpu_cache_miss=0.0125] Epoch 14, Loss: 1.1980, Approx. Train: 0.6297, Approx. Val: 0.6775, Time: 27.172462224960327s Training: 1178it [00:28, 40.85it/s, num_nodes=649114, gpu_cache_miss=0.742, cpu_cache_miss=0.0123] Evaluating: 123it [00:02, 47.36it/s, num_nodes=246728, gpu_cache_miss=0.742, cpu_cache_miss=0.0123] Epoch 15, Loss: 1.1906, Approx. Train: 0.6316, Approx. Val: 0.6740, Time: 28.84046173095703s Training: 1178it [00:27, 42.88it/s, num_nodes=663281, gpu_cache_miss=0.742, cpu_cache_miss=0.0121] Evaluating: 123it [00:02, 47.38it/s, num_nodes=247175, gpu_cache_miss=0.742, cpu_cache_miss=0.0121] Epoch 16, Loss: 1.1838, Approx. Train: 0.6333, Approx. Val: 0.6712, Time: 27.473856687545776s Training: 1178it [00:27, 43.50it/s, num_nodes=668569, gpu_cache_miss=0.742, cpu_cache_miss=0.0119] Evaluating: 123it [00:02, 47.28it/s, num_nodes=246266, gpu_cache_miss=0.742, cpu_cache_miss=0.0119] Epoch 17, Loss: 1.1774, Approx. Train: 0.6350, Approx. Val: 0.6774, Time: 27.08340549468994s Training: 1178it [00:26, 44.40it/s, num_nodes=661165, gpu_cache_miss=0.742, cpu_cache_miss=0.0117] Evaluating: 123it [00:02, 47.12it/s, num_nodes=247866, gpu_cache_miss=0.742, cpu_cache_miss=0.0117] Epoch 18, Loss: 1.1716, Approx. Train: 0.6365, Approx. Val: 0.6777, Time: 26.530193567276s Training: 1178it [00:27, 42.76it/s, num_nodes=661945, gpu_cache_miss=0.742, cpu_cache_miss=0.0116] Evaluating: 123it [00:02, 47.55it/s, num_nodes=244619, gpu_cache_miss=0.742, cpu_cache_miss=0.0116] Epoch 19, Loss: 1.1662, Approx. Train: 0.6379, Approx. Val: 0.6811, Time: 27.5500590801239s Training: 1178it [00:26, 43.81it/s, num_nodes=668503, gpu_cache_miss=0.742, cpu_cache_miss=0.0115] Evaluating: 123it [00:02, 47.48it/s, num_nodes=246921, gpu_cache_miss=0.742, cpu_cache_miss=0.0115] Epoch 20, Loss: 1.1611, Approx. Train: 0.6392, Approx. Val: 0.6821, Time: 26.8914053440094s Training: 1178it [00:27, 42.78it/s, num_nodes=652926, gpu_cache_miss=0.742, cpu_cache_miss=0.0113] Evaluating: 123it [00:02, 46.06it/s, num_nodes=246958, gpu_cache_miss=0.742, cpu_cache_miss=0.0114] Epoch 21, Loss: 1.1563, Approx. Train: 0.6405, Approx. Val: 0.6799, Time: 27.536324739456177s Training: 1178it [00:27, 43.24it/s, num_nodes=678499, gpu_cache_miss=0.742, cpu_cache_miss=0.0112] Evaluating: 123it [00:02, 47.46it/s, num_nodes=246747, gpu_cache_miss=0.742, cpu_cache_miss=0.0112] Epoch 22, Loss: 1.1519, Approx. Train: 0.6416, Approx. Val: 0.6856, Time: 27.241077184677124s Training: 1178it [00:27, 43.56it/s, num_nodes=672612, gpu_cache_miss=0.742, cpu_cache_miss=0.0111] Evaluating: 123it [00:02, 46.31it/s, num_nodes=246751, gpu_cache_miss=0.742, cpu_cache_miss=0.0111] Epoch 23, Loss: 1.1477, Approx. Train: 0.6427, Approx. Val: 0.6832, Time: 27.046754598617554s Training: 1178it [00:26, 44.11it/s, num_nodes=639682, gpu_cache_miss=0.742, cpu_cache_miss=0.011] Evaluating: 123it [00:03, 39.93it/s, num_nodes=247373, gpu_cache_miss=0.742, cpu_cache_miss=0.0111] Epoch 24, Loss: 1.1438, Approx. Train: 0.6437, Approx. Val: 0.6813, Time: 26.70728063583374s Training: 1178it [00:27, 42.98it/s, num_nodes=667361, gpu_cache_miss=0.742, cpu_cache_miss=0.011] Evaluating: 123it [00:02, 47.60it/s, num_nodes=247172, gpu_cache_miss=0.742, cpu_cache_miss=0.011] Epoch 25, Loss: 1.1401, Approx. Train: 0.6447, Approx. Val: 0.6780, Time: 27.40790557861328s Training: 1178it [00:26, 44.15it/s, num_nodes=657206, gpu_cache_miss=0.742, cpu_cache_miss=0.0109] Evaluating: 123it [00:02, 46.82it/s, num_nodes=246593, gpu_cache_miss=0.742, cpu_cache_miss=0.0109] Epoch 26, Loss: 1.1365, Approx. Train: 0.6456, Approx. Val: 0.6858, Time: 26.682600498199463s Training: 1178it [00:27, 42.98it/s, num_nodes=662702, gpu_cache_miss=0.742, cpu_cache_miss=0.0108] Evaluating: 123it [00:02, 47.51it/s, num_nodes=247157, gpu_cache_miss=0.742, cpu_cache_miss=0.0108] Epoch 27, Loss: 1.1332, Approx. Train: 0.6464, Approx. Val: 0.6811, Time: 27.411181926727295s
Checklist
Please feel free to remove inapplicable items for your PR.
Changes