SJTU-IPADS / ugache

Apache License 2.0
20 stars 4 forks source link

F /ugache/coll_cache_lib/cache_context.cu:2831] Check failed: large_exclude % 2 == 0 #1

Closed LukeLIN-web closed 10 months ago

LukeLIN-web commented 10 months ago

I am trying to reproduce the paper. The pre process goes well. Env: 4 * A100-SXM4-40GB docker container: built from repo.

But I run

cd /ugache/example/gnn
# use DGL for graph sampling
python dgl_sample.py

It shows

[2023-11-09 08:43:13.581903: F /ugache/coll_cache_lib/cache_context.cu:2831] Check failed: large_exclude % 2 == 0 
[2023-11-09 08:43:13.582194: E /ugache/coll_cache_lib/cache_context.cu:2591] Asymm Coll cache (policy: coll_cache_asymm_link) | local 244870 / 2449029 nodes ( 10.00 %~10.00 %) | remote 729121 / 2449029 nodes ( 29.77 %) | cpu 1475038 / 2449029 nodes ( 60.23 %) | 94.36 MB | 0.0344182 secs 
test_result:init:feat_nbytes=979611600
[test_result:init:cache_nbytes=98940400
2023-11-09 08:43:13.582231: E /ugache/coll_cache_lib/common.cc:107] too many mem allocated for forcescale?247351->4453
 mask is 1111111111111111111111111111111111111111111111111111111111100000
[2023-11-09 08:43:13.582352: F /ugache/coll_cache_lib/cache_context.cu:2831] Check failed: large_exclude % 2 == 0 
[2023-11-09 08:43:13.583119: E /ugache/coll_cache_lib/cache_context.cu:2591] Asymm Coll cache (policy: coll_cache_asymm_link) | local 243091 / 2449029 nodes ( 9.93 %~10.00 %) | remote 730900 / 2449029 nodes ( 29.84 %) | cpu 1475038 / 2449029 nodes ( 60.23 %) | 94.36 MB | 0.0344854 secs 
test_result:init:feat_nbytes=979611600
test_result:init:cache_nbytes=98940400
[2023-11-09 08:43:13.583247: F /ugache/coll_cache_lib/cache_context.cu:2831] Check failed: large_exclude % 2 == 0 
[2023-11-09 08:43:13.583479: E /ugache/coll_cache_lib/cache_context.cu:2591] Asymm Coll cache (policy: coll_cache_asymm_link) | local 242898 / 2449029 nodes ( 9.92 %~10.00 %) | remote 731093 / 2449029 nodes ( 29.85 %) | cpu 1475038 / 2449029 nodes ( 60.23 %) | 94.36 MB | 0.0339713 secs 
test_result:init:feat_nbytes=979611600
test_result:init:cache_nbytes=98940400
[2023-11-09 08:43:13.583611: F /ugache/coll_cache_lib/cache_context.cu:2831] Check failed: large_exclude % 2 == 0 
molamooo commented 10 months ago

Sorry for the trouble. We didn't handle platforms other than 4xV100 and 8xA100 propoerly. A quick fix is to change https://github.com/SJTU-IPADS/ugache/blob/main/coll_cache_lib/coll_cache/asymm_link_desc.cc L212 and L264 from 56 to 60. We'll fix this in the future.

LukeLIN-web commented 10 months ago

Sorry for the trouble. We didn't handle platforms other than 4xV100 and 8xA100 propoerly. A quick fix is to change https://github.com/SJTU-IPADS/ugache/blob/main/coll_cache_lib/coll_cache/asymm_link_desc.cc L212 and L264 from 56 to 60. We'll fix this in the future.

Thank you for your reply. I changed L212 and L264 from 56 to 60. but the same output still occurs.

molamooo commented 10 months ago

Sorry for the trouble. We didn't handle platforms other than 4xV100 and 8xA100 propoerly. A quick fix is to change https://github.com/SJTU-IPADS/ugache/blob/main/coll_cache_lib/coll_cache/asymm_link_desc.cc L212 and L264 from 56 to 60. We'll fix this in the future.

Thank you for your reply. I changed L212 and L264 from 56 to 60. but the same output still occurs.

I've tested the modification on the first 4 GPUs on an 8xA100 platform. The issue indeed exists and the modification fixes it.

Did you modified ugache's codebase inside the container at /ugache and recompiled & installed ugache?

LukeLIN-web commented 10 months ago

Thank you! It works.