apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
896 stars 361 forks source link

[CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation #2932

Closed SteNicholas closed 1 day ago

SteNicholas commented 1 week ago

What changes were proposed in this pull request?

Flink supports fallback to vanilla Flink built-in shuffle implementation.

Why are the changes needed?

When quota is unenough or workers are unavailable, RemoteShuffleMaster does not support fallback to NettyShuffleMaster, and RemoteShuffleEnvironment does not support fallback to NettyShuffleEnvironment at present. Flink should support fallback to vanilla Flink built-in shuffle implementation for unenough quota and unavailable workers.

Flink Shuffle Fallback

Does this PR introduce any user-facing change?

/**
 * The shuffle fallback policy determines whether fallback to vanilla Flink built-in shuffle
 * implementation.
 */
public interface ShuffleFallbackPolicy {

  /**
   * Returns whether fallback to vanilla flink built-in shuffle implementation.
   *
   * @param shuffleContext The job shuffle context of Flink.
   * @param celebornConf The configuration of Celeborn.
   * @param lifecycleManager The {@link LifecycleManager} of Celeborn.
   * @return Whether fallback to vanilla flink built-in shuffle implementation.
   */
  boolean needFallback(
      JobShuffleContext shuffleContext,
      CelebornConf celebornConf,
      LifecycleManager lifecycleManager);
}

How was this patch tested?

SteNicholas commented 1 week ago

Ping @reswqa, @codenohup, @RexXiong.

SteNicholas commented 3 days ago

Ping @RexXiong, @FMX.

RexXiong commented 1 day ago

Thanks, merge to main(v0.6.0)