MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

Add endpoint method and path to metrics name. #2850

Closed JDarDagran closed 4 months ago

JDarDagran commented 4 months ago

Problem

Currently, in metrics endpoint there is information gathered that contains SQL Object name + method name. It might be more informative if additional information about endpoints would be added as well.

Solution

Introduce labels (DAO name, DAO method, endpoint method, endpoint path).

Checklist

netlify[bot] commented 4 months ago

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
Latest commit ce3835df09ceb5614d70620afcc045ab31f09480
Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/66a17a26a65cb100080bd928
JDarDagran commented 4 months ago

It is breaking backwards compatibility - however, do we need it in metrics? If so, I may suggest 2 options:

wslulciuc commented 4 months ago

@JDarDagran I agree that more investment is needed in our metrics, and correlating an API call -> query is nice but adding the HTTP method feels unrelated to the metric. This metric is for a DB call, not an HTTP call, though the method invocation chain is implicit. We need consistent naming strategies for not only DAO metrics, but all metrics emitted (ex: should we add the DAO method name to the API metrics? If we go down this route, metrics would contain, so it would seem, unrelated parts). I personally, I'm not a fan of the auto-generated metric names by dropwizard.

I would favor first defining a proposal for metrics and their importance, naming strategy (ie: no longer relying on dropwizard auto-generated names), labels, etc. If we were to introduce breaking changes (and I think we have to), we need a precise metric naming strategy and list of metrics that are critical for debugging DB perf issue before moving forward.

What about adding the HTTP call as a label? This will at least unblock the PR and not introduce a breaking change and new naming strategy.

JDarDagran commented 4 months ago

Reworked it. Now v2 endpoint /metrics/v2 contains following metrics:

marquez_sql_duration_seconds_bucket{object_name="marquez.db.TagDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/tags",le="0.005",} 0.0
marquez_sql_duration_seconds_bucket{object_name="marquez.db.TagDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/tags",le="0.01",} 0.0
marquez_sql_duration_seconds_bucket{object_name="marquez.db.TagDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/tags",le="0.025",} 1.0
...
marquez_sql_duration_seconds_bucket{object_name="marquez.db.TagDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/tags",le="+Inf",} 1.0
marquez_sql_duration_seconds_count{object_name="marquez.db.TagDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/tags",} 1.0
marquez_sql_duration_seconds_sum{object_name="marquez.db.TagDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/tags",} 0.014148125
marquez_sql_duration_seconds_bucket{object_name="marquez.db.NamespaceDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/namespaces",le="0.005",} 0.0
marquez_sql_duration_seconds_bucket{object_name="marquez.db.NamespaceDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/namespaces",le="0.01",} 1.0
...
marquez_sql_duration_seconds_bucket{object_name="marquez.db.NamespaceDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/namespaces",le="+Inf",} 1.0
marquez_sql_duration_seconds_count{object_name="marquez.db.NamespaceDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/namespaces",} 1.0
marquez_sql_duration_seconds_sum{object_name="marquez.db.NamespaceDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/namespaces",} 0.005786375
wslulciuc commented 4 months ago

I think the metric (below) for our DB calls is a great start! But, what the metric is really trying to do is trace the HTTP call to the DB query (i.e. span). I do find it confusing (as I mentioned before) as these labels are unrelated to the metric itself endpoint_method, etc. Though I do think this is better represented as a span, I have some suggestions that will best represent the metric:

Given Example Metric:

marquez_sql_duration_seconds_sum{object_name="marquez.db.NamespaceDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/namespaces",} 0.005786375

Define Metric:

# The time to make the DB call for a given HTTP endpoint.
marquez_db_duration_seconds_by_http_call

Labels:

JDarDagran commented 4 months ago

I think the metric (below) for our DB calls is a great start! But, what the metric is really trying to do is trace the HTTP call to the DB query (i.e. span). I do find it confusing (as I mentioned before) as these labels are unrelated to the metric itself endpoint_method, etc. Though I do think this is better represented as a span, I have some suggestions that will best represent the metric:

Given Example Metric:

marquez_sql_duration_seconds_sum{object_name="marquez.db.NamespaceDao",method_name="findAll",endpoint_method="GET",endpoint_path="/api/v1/namespaces",} 0.005786375

Define Metric:

# The time to make the DB call for a given HTTP endpoint.
marquez_db_duration_seconds_by_http_call

Labels:

  • sql_class: the name of the DAO class (replaces object_name)
  • sql_method: the name of the method called by the DAO class (replaces method_name)
  • http_method: the name of the HTTP method for the given call
  • http_path: the name of the HTTP path for the given call

@wslulciuc I did follow your advice and renamed metric and labels as per your suggestion. If you / any commiter could run CI tests (probably with git-push-fork-to-upstream-branch - I don't know why it doesn't trigger circleCI by default) and review, I'd really appreciate that.