Closed akpatnam25 closed 1 month ago
cc @mridulm @waitinfuture @FMX @SteNicholas @AngersZhuuuu @pan3793
@akpatnam25, no any document update?
@akpatnam25, no any document update?
No I don't think there is any document to update @SteNicholas
can we merge this PR if it is good to go? :) @SteNicholas
Just hold on a second. The field is skipped for a reason. Let me find out why the field is not included in the proto.
org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile
Abstract meta manager will try to resolve workerinfos who are "DEFAULT_RACK".
org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile Abstract meta manager will try to resolve workerinfos who are "DEFAULT_RACK".
@FMX yes that is correct. However, this can be something different based on what is passed in to the workerInfo from the workers (see this PR). If that resolution on the worker side does not take place, master can try to resolve when it restores the snapshot.
@FMX essentially we need this whenever the networkLocation != DEFAULT_RACK. In this case, master should restore the snapshot with that custom network location. hope this makes sense
This is essentially something we missed in this PR: https://github.com/apache/celeborn/pull/2367
Otherwise network location is never persisted into ratis.
If the rack configs are changed and the masters are restarted but workers didn't. This pr will get the wrong network locations until the workers are lost and register again.
In stress testing, did the master fail to resolve network locations?
Thanks @FMX!! Yes, I understand the concern. This is basically only since our rack configs will never change. I described more below :slightly_smiling_face:
In our environment, there are 2 ways for the master to know the networkLocation
:
networkLocation
-- this is an expensive call for 400+ workersnetworkLocation
as a part of the registration processGiven this, we chose the 2nd approach above. Essentially, when a worker sends its networkLocation
, that networkLocation
will never change for us. The current logic in the code is that if the networkLocation
upon reading the snapshot is default, then it will again try to query the external API at the Master again once more (this is not a desired code path, but is basically the backup in case the networkLocation
is null in the snapshot). This means that networkLocation
must be persisted into Ratis which is basically the cause of this PR. cc @mridulm
Even if rack configs changed, all masters would never be restarted all at the same time anyways in a production setting, so this would cause split brain anyways for a period of time, which would not be good for the cluster. Let me know what you think @FMX
@FMX I updated this PR to have a config to persist the networkLocation
as we spoke offline. I think the failing test is unrelated to this PR, let me know if you think otherwise.
thanks @FMX and everyone else for the reviews!! :)
What changes were proposed in this pull request?
Fixing a bug where the
networkLocation
is not persisted in Ratis, and the master defaults toDEFAULT_RACK
when it loads the snapshot. This was missed in https://github.com/apache/celeborn/pull/2367 unfortunately, and it came up during our stress testing internally.Why are the changes needed?
Needed for custom network aware replication, so that networkLocation state is kept in snapshot file.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Updated unit test to ensure serde is correct.