awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
16 stars 6 forks source link

Plotting graphs for cron jobs #1015

Open surajkota opened 4 years ago

surajkota commented 4 years ago

Currently we publish client-id, action-id and a set of custom defined labels as dimensions for a metric. According to Cloudwatch dimension documentation ->

CloudWatch treats each unique combination of dimensions as a separate metric, even if the metrics have the same metric name. You can only retrieve statistics using combinations of dimensions that you specifically published. When you retrieve statistics, specify the same values for the namespace, metric name, and dimension parameters that were used when the metrics were create

According to the example on above doc, one cannot retrieve statistics by specifying only a subset of dimensions or have wildcard for any dimension. This means customer cannot plot or create alarms aganist a metric if they setup cron jobs since the action-id is different everytime

Do we have design doc for the cloudwatch exporter? Is there a workaround for this issue?

Another issue to think about as part of this - Current assumption is the only thing which keeps the metrics from being mixed up between multiple parallel runs with same toml file is the action-id. If we remove client-id and action-id from the dimensions, there is no way to differentiate which run the metric was generated from.

jlcontreras commented 4 years ago

one cannot retrieve statistics by specifying only a subset of dimensions or have wildcard for any dimension

I was not aware of this limitation of CloudWatch :( It is indeed a problem, as you say a workaround is going to be needed in order to be able to differentiate between different runs of a cronjob.

We haven't designed anything for this task, since we hadn't realised the problem

surajkota commented 4 years ago

Jose suggested documentation available in cloudwatch exporter and metrics pusher readme

surajkota commented 4 years ago

Ok, to mitigate the issue I suggest the following -

  1. Skip adding clien-id, action-id and possibly parent-action-id(do we push this in case of cron jobs?) to cloudwatch metrics dimensions
  2. Make label a required field in toml and leave it upto user to use either unique metric names or labels for different benchmarks

Let me know if it has any impact I did not list.

Questions:

  1. What is the impact for prometheous if I make the change to skip adding client-id and action-id to the list of labels in metrics pusher itself rather than cloudwatch exporter?
jlcontreras commented 4 years ago

The problem with dropping action-id is we lose the only unique identifier we have for benchmark runs, making it difficult or in some cases impossible to know which run a metric corresponds to.

I think forcing users to specify unique label-metric name combinations defeats the purpose of labels (which is to categorize and classify runs so hey can be grouped afterwards). What do you think of making the field task_name a required one and enforcing some particular syntax on it? As a customer, it would seem more logical to me that task_name must be unique.

The question of how to tell different cronjob runs apart without using action-id is still open though.

As for your question, I'd double check this but I believe pod labels get exported as labels for metrics automatically by prometheus. As action-id and client-id are set as such they would still be exported to prometheus. As I said though, I'd verify the assumption is true.

surajkota commented 4 years ago

If you check the AWS docs I posted earlier - user does not need to have unique label and metric name across benchmarks. User needs to make sure that the combination of metric_name and dimension is unique across the benchmark jobs. So, as a user I need to keep either the metric name or one of the dimension unique.

As mentioned previously, I have made task_name a required field.

As of "how to tell different cronjob runs" - there isn't a need for this. I have checked other service metrics on cloudwatch and they do not have a unique identified for each metric value being pushed. It is a time series