apache / incubator-heron

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter
https://heron.apache.org/
Apache License 2.0
3.65k stars 597 forks source link

Multiple metrics providers may return different ComponentMetrics for the same query #2389

Open huijunw opened 6 years ago

huijunw commented 6 years ago

There are metrics providers in dhalion: tracker metrics provider, tmaster/metricscache metrics provdier. They return results with different structures.

For example to query component : __stmgr__ metric mane : __time_spent_back_pressure_by_compid/container_1_backpressurebolt_1

The tracker api returns something like this:

{
  status: "success",
  executiontime: 0.0036330223083496094,
  message: "",
  version: "0.14.11",
  result: {
    metrics: {
      __time_spent_back_pressure_by_compid/container_1_backpressurebolt_1: {
        stmgr-1: "1718337.000000"
      }
    },
    interval: 1507158089,
    component: "__stmgr__"
  }
}

The tmaster/metricscache returns something like this:

status {
  status: OK
}
metric {
  instance_id: "stmgr-1"
  metric {
    name: "__time_spent_back_pressure_by_compid/container_1_backpressurebolt_1"
    interval_values {
      value: "60001.000000"
      interval {
        start: 1507158311
        end: 1507158371
      }
    }
}
metric {
  instance_id: "stmgr-2"
}
metric {
  instance_id: "stmgr-3"
}
metric {
  instance_id: "stmgr-4"
}
metric {
  instance_id: "stmgr-5"
}
interval: 1507158339

tracker result contains 1 instance in the component, while tmaster/metricscache result contains 5 instances in the component.

https://github.com/twitter/heron/blob/master/heron/healthmgr/src/java/com/twitter/heron/healthmgr/sensors/BackPressureSensor.java#L90, the sensor assumes tracker result format, which lost metrics working with tmaster/metricscache.

Multiple metrics providers may return different ComponentMetrics for the same query.

  1. Either metrics providers agree on one same return ComponentMetrics structure
  2. Or sensors are able to handle different ComponentMetrics structure

For the backpressure metrics case, we may have a fix to walk around, but I think HealthMgr may have a general solution to handle various (metrics + metricsProvider + sensor) configurations.

avflor commented 6 years ago

Why do they have different APIs? I thought the metricscache would have the same api as metricstracker.

maosongfu commented 6 years ago

I am also wondering the APIs are different?

huijunwu commented 6 years ago

My previous description included the two results. tracker result has 1 instance. tmaster result has 5 instances.

Besides the previous 2 directions from healthmgr perspective, another direction is to unify tmaster/tracker metric query interface, so that there should be only one metric query interface. Then only one metric query client is necessary and this query client can be re-used

ashvina commented 6 years ago

@huijunw This one is tricky because the metrics cache is returning empty structures.

The metric provider contract should be the same for all metrics providers. I.e. for easy development and maintenance of sensors, all metrics providers must return the same ComponentMetrics structure. In this case (BackPressureSensor.java#L90), the sensor is querying stmgr component and backpressure metric. Accordingly it expects the metrics provider to return one metric in the provided component. However, the metrics cache is returning empty structures. This seems like a common issue and could be handled by MetricsCacheMetricsProvider. Instead of adding logic in all sensors to filter the empty structures, I would suggest fixing the MetricsCacheMetricsProvider. You are most familiar with the MetricsCacheMetricsProvider code. What do you think?

huijunw commented 6 years ago

A metric unique location is three item tuple <component, instance, metric-name>, but the Dhalion.MetricsProvider interface uses <component, metric-name> to locate metrics. Tmaster/metricscache returns a list of metrics including all instances, which is reasonable.

Metricscacheprovider not only provides backpressure metrics but also may provide metrics for other sensors. It is a general metrics provider and is supposed to return the exact query result rather than optimize for backpressure sensor.

@ashvina

ashvina commented 6 years ago

@huijunw Current Dhalion.MetricsProvider interface contract is to return value of a given metricName for all instances of a component. Hence the interface definition. It is not back pressure specific.

In this case the metric name includes the instance name, container_1_backpressurebolt_1. Hence the metric provider should return just one value specific to the component, instance, metric-name. However the implementation of metrics cache provider is returning multiple empty values too.

As earlier, i think the fix should go in the metrics provider.

huijunwu commented 6 years ago

Since Dhalion.MetricsProvider is not backpressure specific, the MetricscacheMetricsprovider should not be backpressure specific.

I think the backpressure Sensor is backpressure specific, where the bakcpressure specific fix should go @ashvina

Besides, I do not think 'to guess instance-id from metric-name' is proper

ashvina commented 6 years ago

@huijunw The inclusion of instance id in metric name is not enforced by dhalion. I think that discussion is outside the scope for this PR.

Can you please explain why is the metrics-cache returning empty structures when the metric is not available? for e.g.

metric {
  instance_id: "stmgr-2"
}
ashvina commented 6 years ago

There is one check in MetricsCacheMetricsProvider to avoid adding empty entries: https://github.com/twitter/heron/blob/master/heron/healthmgr/src/java/com/twitter/heron/healthmgr/sensors/MetricsCacheMetricsProvider.java#L122

I think a similar check is required here too: https://github.com/twitter/heron/blob/master/heron/healthmgr/src/java/com/twitter/heron/healthmgr/sensors/MetricsCacheMetricsProvider.java#L127. That should fix the problem and avoid adding filters in all the sensors.

huijunwu commented 6 years ago

when you query by <component, metric-name> to tmaster, it means you want all instances. The result should include all instances. If the metric-value is empty, the result explicitly says the value is empty rather than hide the value implying the empty value. This Tmaster behavior exists for long time even before I started to work on this project ..

This fix is not for all the sensors. It is for backpressure sensor because backpressure sensor assumes query by <component, metrics-name> returns one instance instead of all instances

ashvina commented 6 years ago

Any topology component specific metric provided by stream manager will need this change. For e.g. BackPressureSensor and BufferSizeSensor.

huijunwu commented 6 years ago

If the sensor just wants to query only one instance, it should pass instance-id in the query request. Thus I would suggest adding <component, instance, metrics-name> query interface for Dhalion.MetricsPRovider besides <component, metric-name> interface, which enables its query capability.

ashvina commented 6 years ago

There are two inconsistencies here.

  1. The topology component metrics related to back pressure and buffer size are owned by stream manager.

  2. The tracker response is different from the tmaster response.

As a result special handling is required in some sensors. There are atleast 3 different ways to solve this :-). And it is not worth debating on which one is the best. For now, i will try to implement what @huijunw suggests. It will impact all the sensors that consume component metrics from the stream manager.