Open syamajala opened 4 weeks ago
This error only seems to appear when I run with profiling.
which branch are you using? There is no line 2xxx in runtime_impl.h
commit c032dab254f423ccab36d05c47fed42b94f0b3f5 (HEAD)
Merge: da0427fad 78d10af37
Author: Elliott Slaughter <elliottslaughter@gmail.com>
Date: Thu Sep 12 23:37:18 2024 +0000
Merge branch 'ci-fix-gasnet-stable' into 'master'
ci: Fix CI for GASNet stable branch
See merge request StanfordLegion/legion!1463
I thought you were using the new barrier branch. The error means the reduction value used by barrier is too big. Currently, the active message upper limit of UCX is 8K, I am surprised that Legion uses such a big value with barrier. You can try to increase the size to 16K for now https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/ucx/ucp_internal.cc?ref_type=heads#L67
It needs to be 32K. I also have to run with -ucx:pb_max_size 32768
.
@eddy16112 do you expect this will go away with the new barrier implementation or should i leave this issue open for now?
In the new barrier branch, we need to send the tree to child nodes, so we have seen cases that header+tree+payload > max_size
, our solution is to divide the tree into fragments, but in your case, header+payload
is already larger than max_size
, so I do not think we fixed this bug.
Let's keep this bug open for now.
Yeah, I am surprised we are hitting this on the legacy branch. Likely that's just been there for a while and never tested.
I think outside of legate no one really uses ucx. I tried to use ucx on Summit a year ago, but hit #1396 and #1059. Most of the big machines these days are on slingshot-11 now, but the s3df cluster at SLAC is 100Gb ethernet and even with gasnet we would need to use the ucx conduit in that case.
Likely that's just been there for a while and never tested.
The new critical path profiling infrastructure will bang on it in a way that it didn't used to get used very often, which is very likely what is happening here.
@lightsighter I am surprised that the critical path profiling uses such a big reduction data, almost 36K.
Legion is not using that large of a reduction. It's reducing this data structure which is not 36K:
I suspect there is a performance bug in Realm where it always sends the reduced value for all generations which continues to grow larger and larger rather than sending the reduced values for the difference in subscribed generations.
I suspect there is a performance bug in Realm where it always sends the reduced value for all generations which continues to grow larger and larger rather than sending the reduced values for the difference in subscribed generations.
It think it's close but not exactly - the owner can collapse subscribe generations into a single notify active message depending on what was the latest subscribe generation observed. In case, we are running 1000 gens that would result into a single notification for M
subscribers each of which will carry 1000 * reduce_value_size
. That is where I think we are hitting this. The solution is in my opinion should be much like we do in the "scalable" barrier branch where you would do a staged-broadcast where a number of stages equals to the max_payload_size / payload_size
.
@syamajala Is this blocking? I can consider doing a separate fix if we need it immediately otherwise we can wait for when we merge the scalable barrier branch that should in my opinion address it.
I have a work around for right now so it's not blocking.
I am seeing the following assertion when running cunumeric with ucx:
Here is a stack trace: