[CELEBORN-1549] Fix networkLocation persistence into Ratis

apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.

https://celeborn.apache.org/

Apache License 2.0

862 stars 351 forks source link

[CELEBORN-1549] Fix networkLocation persistence into Ratis #2669

Closed akpatnam25 closed 1 month ago

akpatnam25 commented 1 month ago

What changes were proposed in this pull request?

Fixing a bug where the networkLocation is not persisted in Ratis, and the master defaults to DEFAULT_RACK when it loads the snapshot. This was missed in https://github.com/apache/celeborn/pull/2367 unfortunately, and it came up during our stress testing internally.

Why are the changes needed?

Needed for custom network aware replication, so that networkLocation state is kept in snapshot file.

Does this PR introduce any user-facing change?

How was this patch tested?

Updated unit test to ensure serde is correct.

akpatnam25 commented 1 month ago

cc @mridulm @waitinfuture @FMX @SteNicholas @AngersZhuuuu @pan3793

SteNicholas commented 1 month ago

@akpatnam25, no any document update?

akpatnam25 commented 1 month ago

@akpatnam25, no any document update?

No I don't think there is any document to update @SteNicholas

akpatnam25 commented 1 month ago

can we merge this PR if it is good to go? :) @SteNicholas

FMX commented 1 month ago

Just hold on a second. The field is skipped for a reason. Let me find out why the field is not included in the proto.

FMX commented 1 month ago

org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile

Abstract meta manager will try to resolve workerinfos who are "DEFAULT_RACK".

akpatnam25 commented 1 month ago

org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile Abstract meta manager will try to resolve workerinfos who are "DEFAULT_RACK".

@FMX yes that is correct. However, this can be something different based on what is passed in to the workerInfo from the workers (see this PR). If that resolution on the worker side does not take place, master can try to resolve when it restores the snapshot.

akpatnam25 commented 1 month ago

@FMX essentially we need this whenever the networkLocation != DEFAULT_RACK. In this case, master should restore the snapshot with that custom network location. hope this makes sense

akpatnam25 commented 1 month ago

This is essentially something we missed in this PR: https://github.com/apache/celeborn/pull/2367

Otherwise network location is never persisted into ratis.

FMX commented 1 month ago

If the rack configs are changed and the masters are restarted but workers didn't. This pr will get the wrong network locations until the workers are lost and register again.

FMX commented 1 month ago

In stress testing, did the master fail to resolve network locations?

akpatnam25 commented 1 month ago

Thanks @FMX!! Yes, I understand the concern. This is basically only since our rack configs will never change. I described more below :slightly_smiling_face: In our environment, there are 2 ways for the master to know the networkLocation:

Query external REST API for each worker to get the networkLocation -- this is an expensive call for 400+ workers
Worker sends its networkLocation as a part of the registration process

Given this, we chose the 2nd approach above. Essentially, when a worker sends its networkLocation, that networkLocation will never change for us. The current logic in the code is that if the networkLocation upon reading the snapshot is default, then it will again try to query the external API at the Master again once more (this is not a desired code path, but is basically the backup in case the networkLocation is null in the snapshot). This means that networkLocation must be persisted into Ratis which is basically the cause of this PR. cc @mridulm

akpatnam25 commented 1 month ago

Even if rack configs changed, all masters would never be restarted all at the same time anyways in a production setting, so this would cause split brain anyways for a period of time, which would not be good for the cluster. Let me know what you think @FMX

akpatnam25 commented 1 month ago

@FMX I updated this PR to have a config to persist the networkLocation as we spoke offline. I think the failing test is unrelated to this PR, let me know if you think otherwise.

akpatnam25 commented 1 month ago

thanks @FMX and everyone else for the reviews!! :)