Open huijunw opened 6 years ago
Why do they have different APIs? I thought the metricscache would have the same api as metricstracker.
I am also wondering the APIs are different?
My previous description included the two results. tracker result has 1 instance. tmaster result has 5 instances.
Besides the previous 2 directions from healthmgr perspective, another direction is to unify tmaster/tracker metric query interface, so that there should be only one metric query interface. Then only one metric query client is necessary and this query client can be re-used
@huijunw This one is tricky because the metrics cache
is returning empty structures.
The metric provider contract should be the same for all metrics providers. I.e. for easy development and maintenance of sensors, all metrics providers must return the same ComponentMetrics
structure. In this case (BackPressureSensor.java#L90
), the sensor is querying stmgr
component and backpressure
metric. Accordingly it expects the metrics provider to return one metric in the provided component. However, the metrics cache
is returning empty structures. This seems like a common issue and could be handled by MetricsCacheMetricsProvider
. Instead of adding logic in all sensors to filter the empty structures, I would suggest fixing the MetricsCacheMetricsProvider
. You are most familiar with the MetricsCacheMetricsProvider
code. What do you think?
A metric unique location is three item tuple <component, instance, metric-name>, but the Dhalion.MetricsProvider interface uses <component, metric-name> to locate metrics. Tmaster/metricscache returns a list of metrics including all instances, which is reasonable.
Metricscacheprovider not only provides backpressure metrics but also may provide metrics for other sensors. It is a general metrics provider and is supposed to return the exact query result rather than optimize for backpressure sensor.
@ashvina
@huijunw Current Dhalion.MetricsProvider
interface contract is to return value of a given metricName
for all instances of a component
. Hence the interface definition. It is not back pressure specific.
In this case the metric name includes the instance name, container_1_backpressurebolt_1
. Hence the metric provider should return just one value specific to the component, instance, metric-name
. However the implementation of metrics cache provider is returning multiple empty values too.
As earlier, i think the fix should go in the metrics provider.
Since Dhalion.MetricsProvider is not backpressure specific, the MetricscacheMetricsprovider should not be backpressure specific.
I think the backpressure Sensor is backpressure specific, where the bakcpressure specific fix should go @ashvina
Besides, I do not think 'to guess instance-id from metric-name' is proper
@huijunw The inclusion of instance id in metric name is not enforced by dhalion. I think that discussion is outside the scope for this PR.
Can you please explain why is the metrics-cache returning empty structures when the metric is not available? for e.g.
metric {
instance_id: "stmgr-2"
}
There is one check in MetricsCacheMetricsProvider
to avoid adding empty entries: https://github.com/twitter/heron/blob/master/heron/healthmgr/src/java/com/twitter/heron/healthmgr/sensors/MetricsCacheMetricsProvider.java#L122
I think a similar check is required here too: https://github.com/twitter/heron/blob/master/heron/healthmgr/src/java/com/twitter/heron/healthmgr/sensors/MetricsCacheMetricsProvider.java#L127. That should fix the problem and avoid adding filters in all the sensors.
when you query by <component, metric-name> to tmaster, it means you want all instances. The result should include all instances. If the metric-value is empty, the result explicitly says the value is empty rather than hide the value implying the empty value. This Tmaster behavior exists for long time even before I started to work on this project ..
This fix is not for all the sensors. It is for backpressure sensor because backpressure sensor assumes query by <component, metrics-name> returns one instance instead of all instances
Any topology component specific metric provided by stream manager will need this change. For e.g. BackPressureSensor and BufferSizeSensor.
If the sensor just wants to query only one instance, it should pass instance-id in the query request. Thus I would suggest adding <component, instance, metrics-name> query interface for Dhalion.MetricsPRovider besides <component, metric-name> interface, which enables its query capability.
There are two inconsistencies here.
The topology component metrics related to back pressure and buffer size are owned by stream manager.
The tracker response is different from the tmaster response.
As a result special handling is required in some sensors. There are atleast 3 different ways to solve this :-). And it is not worth debating on which one is the best. For now, i will try to implement what @huijunw suggests. It will impact all the sensors that consume component metrics from the stream manager.
There are metrics providers in dhalion: tracker metrics provider, tmaster/metricscache metrics provdier. They return results with different structures.
For example to query component :
__stmgr__
metric mane :__time_spent_back_pressure_by_compid/container_1_backpressurebolt_1
The tracker api returns something like this:
The tmaster/metricscache returns something like this:
tracker result contains 1 instance in the component, while tmaster/metricscache result contains 5 instances in the component.
https://github.com/twitter/heron/blob/master/heron/healthmgr/src/java/com/twitter/heron/healthmgr/sensors/BackPressureSensor.java#L90, the sensor assumes tracker result format, which lost metrics working with tmaster/metricscache.
Multiple metrics providers may return different ComponentMetrics for the same query.
For the backpressure metrics case, we may have a fix to walk around, but I think HealthMgr may have a general solution to handle various (metrics + metricsProvider + sensor) configurations.