Open songweijia opened 3 years ago
The test case can be found in the dairy_farm_exp branch of cascade, commit c8e97a671832d1329bdbdd12ee8e8aebdbb25c2c
I'm wondering if this could be an instance of #215
As discussed in that thread, it is no longer good enough to just declare the SST as volatile. We are supposed to shift to using std::atomics, with "relaxed" consistency for guarded objects in the SST, but acquire-release consistency for the guards themselves, such as the "SMC message is ready" bit, the "how many messages I have sent" guard, etc.
Specifically, this suggests that at a minimum our main predicate loop should have an acquire-release atomic at the top of the loop, like a counter that can be incremented each time we loop. This would signal to the hardware that we will be accessing data that may have changed due to I/O. It is interesting to realize that we don't need an actual lock and that any variable of the proper std::atomic type (acquire-release, specifically) will suffice. That would create a memory fence, causing a very brief pause while asynchronous updates get a chance to reach DRAM and to invalidate stale cache entries, and then the next iteration of the predicate loop will see fresher data if anything has recently arrived...
In one of the Cascade evaluation experiments, I need to issue an ordered send to ask all nodes in a shard to dump their timestamp logs. This mechanism was working properly in most cases. But in a special case where the containing subgroup is the second subgroup of the same type, the ordered send never finishes. Tracing into the execution using gdb, I found
MulticastGroup::delivery_trigger()
is not properly triggered in the predicated thread on the sender node. For the very first ordered send, testing shows thatlocally_stable_sst_messages[subgroup_num]
is not empty and the first sequence read inleast_undelivered_sst_seq_num
is 0. However, themin_stable_num
variable gets -1, unexpectedly excluding the code guarded by this condition. The message never gets delivered.I'm not sure if this always happens for the subgroups which are not the first one of a subgroup type. It needs more investigation.
I've talked to Sagar about this since he knows the best of this part. He will investigate it later. I can find a workaround for my tests but it would be great if we fix this thoroughly.