dstackai / dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.53k stars 154 forks source link

[Feature]: Provide an option to delete old run logs automatically #1707

Open jvstme opened 1 month ago

jvstme commented 1 month ago

Problem

Run logs are never deleted, so disk space usage on the server can grow quickly and indefinitely.

Solution

Add a setting that will specify the TTL for run logs. Once the run finishes and its TTL expires, dstack server should delete this run's logs automatically.

The setting can be specified in ~/.dstack/server/config.yml at project level. The values can be in ISO 8601 duration format.

projects:
- name: main
  run_logs_ttl: P2M  # 2 months
  backends:
  - type: aws
    creds:
      type: default
  # ...

The setting should work for both file storage and CloudWatch storage options.

The default is to store logs indefinitely.

Workaround

Delete the logs manually in ~/.dstack/server/projects/<project_name>/logs or store logs in AWS CloudWatch and use CloudWatch-specific mechanisms for logs expiry.

Alternative/future solutions

dstack server could automatically delete the oldest logs when storage is approaching the disk capacity or the specified limit.

Would you like to help us implement this feature by sending a PR?

Yes

r4victor commented 1 month ago

run_logs_ttl seems to make more sense as a server-level setting than project-level. It could be

projects:
encryption:
run_logs:
  ttl:

Also consider configuring other logger setting via config.yml then (cloudwatch group, region).

jvstme commented 1 month ago

Different projects can have different TTL requirements, e.g. a project for production deployments may set a long TTL while development projects may set a shorter TTL. Projects that are known to produce a lot of logs can also benefit from shorter TTLs.

A compromise could be to allow setting the default TTL at server level and overriding it at project level.

projects:
- name: main
  run_logs:
    ttl: P2M  # 2 months

run_logs:
  ttl: P1W  # 1 week
github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 30 days with no activity.