Open surajkota opened 4 years ago
one cannot retrieve statistics by specifying only a subset of dimensions or have wildcard for any dimension
I was not aware of this limitation of CloudWatch :( It is indeed a problem, as you say a workaround is going to be needed in order to be able to differentiate between different runs of a cronjob.
We haven't designed anything for this task, since we hadn't realised the problem
Jose suggested documentation available in cloudwatch exporter and metrics pusher readme
Ok, to mitigate the issue I suggest the following -
Let me know if it has any impact I did not list.
Questions:
The problem with dropping action-id is we lose the only unique identifier we have for benchmark runs, making it difficult or in some cases impossible to know which run a metric corresponds to.
I think forcing users to specify unique label-metric name combinations defeats the purpose of labels (which is to categorize and classify runs so hey can be grouped afterwards). What do you think of making the field task_name a required one and enforcing some particular syntax on it? As a customer, it would seem more logical to me that task_name must be unique.
The question of how to tell different cronjob runs apart without using action-id is still open though.
As for your question, I'd double check this but I believe pod labels get exported as labels for metrics automatically by prometheus. As action-id and client-id are set as such they would still be exported to prometheus. As I said though, I'd verify the assumption is true.
If you check the AWS docs I posted earlier - user does not need to have unique label and metric name across benchmarks. User needs to make sure that the combination of metric_name and dimension is unique across the benchmark jobs. So, as a user I need to keep either the metric name or one of the dimension unique.
As mentioned previously, I have made task_name a required field.
As of "how to tell different cronjob runs" - there isn't a need for this. I have checked other service metrics on cloudwatch and they do not have a unique identified for each metric value being pushed. It is a time series
Currently we publish client-id, action-id and a set of custom defined labels as dimensions for a metric. According to Cloudwatch dimension documentation ->
According to the example on above doc, one cannot retrieve statistics by specifying only a subset of dimensions or have wildcard for any dimension. This means customer cannot plot or create alarms aganist a metric if they setup cron jobs since the action-id is different everytime
Do we have design doc for the cloudwatch exporter? Is there a workaround for this issue?
Another issue to think about as part of this - Current assumption is the only thing which keeps the metrics from being mixed up between multiple parallel runs with same toml file is the action-id. If we remove client-id and action-id from the dimensions, there is no way to differentiate which run the metric was generated from.