node: 0: error: ...ross-inline.h:106: Maximum zero-offset tie chain reached (100), increase #define in ross-types.h

lzk23 commented 1 year ago

Hello, I am testing the running multiple jobs with contiguous allocation as in the Exercise 3 in (https://github.com/codes-org/codes/wiki/quick-start-interconnects). However, this error, node: 0: error: /home/codes-dev/build-ross/include/ross-inline.h:106: Maximum zero-offset tie chain reached (100), increase #define in ross-types.h occurs. I try to increase the value MAX_TIE_CHAIN in ross-types.h. However, with this value increasing, the simulation eat much memory and run extremely slowly. For this case, i have increased MAX_TIE_CHAIN from 100 to 20000, and the error disappears. However, the memory required is more than 300G, which lead to the program broken. How to fix this problem. Thanks a lot.

nmcglo commented 1 year ago

So this is one downside to the ROSS unbiased tiebreaker. The unbiased tiebreaker feature of ROSS will fairly and consistently choose an ordering of events that are tied temporally with other events.

Things get complicated, however, when zero-offset events are also present. To clarify: zero offset events are events that are created with zero tw_stime delay from the event that created them. Since zero offset events naturally tie, temporally, with their causal event (and also any events that tie with it), consistently ordering those events in a fair way requires an array of tie breaking values (automatically generated by ROSS) with a cardinality that is equal to the number of zero offset "generations"

Ex: if you have an event A that creates another event with zero offset, A'. And A' creates another zero offset event A'', and so on and so forth to get A''''', you'd need a tie breaking value array of size 6 to fairly break ties in a way that doesn't violate causality. That size is the max tie chain length and because this is encoded into messages transmitted across PEs, it has to be statically allocated into each event. Thus the longer that chain needs to be, the heavier the impact on memory will be. Setting that value to 20,000 will mean that each event has an array of 20,000 64-bit floats encoded into it. That's a very heavy structure.

Solutions:

Disable the tiebreaker in ROSS during cmake configuration. USE_RAND_TIEBREAKER is the flag name, I believe. You'll have to re-make ROSS and CODES after this. This will, however, result in your simulation possibly being non-deterministic if there are a significant number of tied events (particularly if they tie at the same time on the same LP).
Determine where all of the zero offset events are coming from and add some positive offset to it, even a tiny amount will make things significantly easier on the tiebreaking feature.

If you want some more context on this tie breaking feature, here's a paper I wrote on it:

https://nmcglo.com/public-files/papers/2021_wsc_tiebreaker.pdf

lzk23 commented 1 year ago

Thanks for your reply. Actually, i don't know the principle behind CODES and ROSS. I have tried to add the flag USE_RAND_TIEBREAKER (-DUSE_RAND_TIEBREAKER=on, is this right?) during cmake configuration for ROSS. However, the problem still exists. As for the second solution, i really don't know how to determine where are the zero offset events.

nmcglo commented 1 year ago

Apologies for delay in response, I've been starting a new job and traveling a lot of November.

The quick solution is actually to set -DUSE_RAND_TIEBREAKER=off when configuring ROSS (then rebuild ROSS and CODES), this will disable the deterministic tiebreaker feature of ROSS which reverts the functionality of ROSS in handling event processing order to the state that it was a year or so ago. For the most part, it is "good enough". The tiebreaker's purpose is to guarantee the deterministic ordering of event processing when there exists simultaneous events in the simulation. Without the tiebreaker there is a mild probability of non-deterministic output and the tiebreaking of simultaneous events is not 'unbiased' which implies that there will be some ruleset that will break ties in a way that doesn't assign an equal probability to any ordering of these simultaneous events.

It should not make significant difference semantically unless you're trying to make very formal and strict statistical analysis on the output of many runs.

codes-org / codes

node: 0: error: ...ross-inline.h:106: Maximum zero-offset tie chain reached (100), increase #define in ross-types.h #231