Lightning-Universe / DiffusionWithAutoscaler

DiffusionWithAutoscaler
Apache License 2.0
29 stars 5 forks source link

feat: Add `IntervalReplacement` strategy #15

Closed akihironitta closed 1 year ago

akihironitta commented 1 year ago

Description of this PR

Introduces interval replacement strategy to keep the endpoint available with spot instances.

To enable this feature, set interruptible=True:

component = AutoScaler(
    DiffusionServer,
-   cloud_compute=L.CloudCompute("gpu-rtx", disk_size=80),
+   cloud_compute=L.CloudCompute("gpu-rtx", interruptible=True, disk_size=80),
    ...
)

Or, to change the default replacement interval, pass IntervalReplacement(...) to AutoScaler:

+ import diffusion_with_autoscaler import IntervalReplacement
 component = AutoScaler(
     DiffusionServer,
-    cloud_compute=L.CloudCompute("gpu-rtx", disk_size=80),
+    cloud_compute=L.CloudCompute("gpu-rtx", interruptible=True, disk_size=80),
+    strategy=IntervalReplacement(interval=30 * 60),  # 30min
     ...
 )

For benchmark results, see here (internal-only at this time): https://www.notion.so/60aca667b72c4aa79e496f5b61c8182a

Known limitations

  1. The endpoint becomes unavailable when an interruptible work is terminated by the cloud service provider, and it gets available again after interval passes AND preemptible instances become available on the provider side.
    • Future work: When spot instances are unavailable from the cloud service, fallback to the proxy or to on-demand instances.
  2. The app dies when any interruptible work is terminated by the cloud service provider during the pending state. If it's terminated during the running state, the app keeps running and tries to spin up interruptible works as in 1. above.
akihironitta commented 1 year ago

@tchaton @hhsecond PTAL