dandi / dandi-hub

Infrastructure and code for the dandihub
https://hub.dandiarchive.org
Other
11 stars 23 forks source link

Request: Enhance Monitoring for User-specific CPU and Disk Usage #187

Open bendichter opened 3 months ago

bendichter commented 3 months ago

Description:

We need a functional system for monitoring usage and cost by user, ideally with a no-code dashboard. This feature would empower us to manage resource allocation and open up registrations to new users more confidently.

Requirements:

  1. CPU and Disk Usage Monitoring by User:

    • Monitor disk usage and CPU usage (or relevant cost factors) for individual users.
    • While disk usage can be monitored with du checks, we need a way to generate reports over time, not just at an instant.
  2. Reporting and Analytics:

    • Provide reports on server options used and duration by user.
    • Create a system to monitor incremental and shared costs. This involves reporting the incremental cost for node creators and shared costs equally among node users.
  3. Dashboard:

    • Develop a no-code dashboard to visualize usage and cost data.
    • Include functionality to pre-set usage limits for users from these dashboard.
  4. Integration and Metrics:

    • Integrate with Graphana and Prometheus for improved metrics collection from AWS and other cloud vendors.
    • Ensure the system can handle cost anomaly detection.

Challenges:

Proposed MVP:

  1. Metrics Collection:

    • Enhance AWS metrics collection to include hourly data instead of just daily totals.
  2. Disk Usage Monitoring:

    • Implement a disk usage monitoring and cleanup procedure.
  3. Cost Anomaly Detection:

    • Use existing tools (e.g., @satra 's anomaly detection system) for total cost anomaly detection.
  4. Graphana and Prometheus Integration:

    • Integrate with Graphana and Prometheus for comprehensive monitoring and alerting.

References:

This is a rough outline based on a convo with @asmacdo. Input and collaboration from the team will be crucial to refining the requirements to meet our needs.

yarikoptic commented 3 months ago

might be worth investigating how/what nebari does that (#186).