frequenz-floss / frequenz-sdk-python

Frequenz Python Software Development Kit (SDK)
https://frequenz-floss.github.io/frequenz-sdk-python/
MIT License
13 stars 17 forks source link

Provide information about the quality of a resampled metric #1021

Open llucax opened 1 month ago

llucax commented 1 month ago

What's needed?

We need a way to inform users about the quality of a resampled metric.

For example, if a sample was calculated only using one very old value, the data quality should be low, while if the data was calculated based on many samples and we had up to date samples, then the quality should be high.

This way actors could make more informed decisions on how to use that data.

Proposed solution

Use cases

No response

Alternatives and workarounds

No response

Additional context

No response

cwasicki commented 1 month ago

In my opinion this is interesting for formulas, e.g. to know how many None's were ignored in the calculation.

llucax commented 1 month ago

@frequenz-floss/python-sdk-team unless someone steps in and shows a use case for this, I think I will close.

shsms commented 1 month ago

We have often seen lower data rates from components without warning because of site-specific issues. I have seen this happen many times, including last week.

Apps need to be able to identify degraded data quality so that they know to be more conservative in their goals. Without it, they will assume that the latest values have a higher accuracy and will overshoot.

llucax commented 1 month ago

But if we assume a small sampling period, which is want to aim for (1s), then you know that the data rate is low or the quality of the data is bad because the resampler will start producing None, right? I agree we need to know when data is degraded, what I'm not sure if the resampler is the best place to do so. I think the resampler should only cover for very short outages, stuff that should be transparent to app developers. Once data is bad enough that you care, the resampler should be fixing it in the first place, right?

llucax commented 1 month ago

So one suggestion was to use the LatestValueCache, extending it to expire the last value and store the timestamp of the last value.

shsms commented 1 month ago

then you know that the data rate is low or the quality of the data is bad because the resampler will start producing None, right?

I think the resampler shouldn't produce None and expect manual intervention like increasing data age in number of sampling periods to 5. Like Christoph said, that is too disruptive for big locations. The resampler should adjust to max data age, if it determines that data rate is lower than the max data age, such that the buffer will have the latest value. But that's a separate issue I guess.

shsms commented 1 month ago

what I'm not sure if the resampler is the best place to do so.

I think it is, because like you said, it tracks source info already and just has to send out one value at startup, and later, whenever the source info is recalculated.

llucax commented 1 month ago

I think the resampler shouldn't produce None and expect manual intervention like increasing data age in number of sampling periods to 5.

Let's see if we are talking about the same.

When? If data is not coming, then yes, it should produce None, there is no data. Right? This might happen temporarily or always. If a site is always producing slow data rates, then there is something fucked with that location, and IMHO in that case, yes, we should fix the location or change the period manually, at least from what I understood @thomas-nicolai-frequenz said, the resampling period can't be changed so lightly or the machine learning part can break.

If it happens sporadically, we should be able to recover when the data comes with the normal rate.

Like Christoph said, that is too disruptive for big locations. The resampler should adjust to max data age, if it determines that data rate is lower than the max data age, such that the buffer will have the latest value. But that's a separate issue I guess.

What do you mean by "adjust to the max data age"? Do you mean it should adjust the max_data_age_in_periods so that we get at least one sample for the low rate input? If so, I don´t think we should do that, this is effectively changing the resampling function dynamically depending on the input data rate.

what I'm not sure if the resampler is the best place to do so. I think it is, because like you said, it tracks source info already and just has to send out one value at startup, and later, whenever the source info is recalculated.

Yeah, but it is done for different reasons. Again, the global resampler is just a way to homogenize the input data assuming the data that comes... comes, and comes at a reasonable rate. If we have no data, the resampler should return None, if you still need to work with an old value, you should save the latest value and the age of this latest value yourself.

So this issue is only about knowing if the data for the last 3 seconds (according to the current defaults we use, resampling period of 1s and max_age_in_periods of 3) is good or bad, and my question still is, do we even need this kind of granularity?

llucax commented 1 month ago

OK, looking at the code, I have some interesting findings that I forgot about:

    max_data_age_in_periods: float = 3.0
    """The maximum age a sample can have to be considered *relevant* for resampling.

    Expressed in number of periods, where period is the `resampling_period`
    if we are downsampling (resampling period bigger than the input period) or
    the *input sampling period* if we are upsampling (input period bigger than
    the resampling period).

    It must be bigger than 1.0.

    Example:
        If `resampling_period` is 3 seconds, the input sampling period is
        1 and `max_data_age_in_periods` is 2, then data older than 3*2
        = 6 seconds will be discarded when creating a new sample and never
        passed to the resampling function.

        If `resampling_period` is 3 seconds, the input sampling period is
        5 and `max_data_age_in_periods` is 2, then data older than 5*2
        = 10 seconds will be discarded when creating a new sample and never
        passed to the resampling function.
    """

So if some location is sending samples every 5 seconds (consistently and from the start), the resampler should be able to cope with it without issues, data for the last 15 seconds should be used to calculate the current sample. If this didn't happen, maybe we have a bug in the resampler.

cwasicki commented 1 month ago

it is already dynamic (as it depends on the input sampling period)

Are you sure that this is done if the input data is not on a fixed sampling period? IIUC it can also be None, which I assumed would be used if we use the raw data as input.

llucax commented 1 month ago

I didn't get what do you mean by "the input data is not on a fixed sampling period".

cwasicki commented 1 month ago

If we resample irregular sample periods, e.g. if it's done on the raw data from the components I am not sure we can rely on that.

llucax commented 1 month ago

So, if we are downsampling, the data considered for the current window is always a fixed time span (max_age_in_periods * resampling_period). If we are upsampling though, then the input samples with the following age are considered for the current window: max_age_in_periods * input_sampling_period, where input_sampling_period is dynamic (will be updated for each received sample as total_time_receiving / total_samples_received), so if the input source rate is stable, it should be more or less constant, but if we have gaps often, then the input_sampling_period will increase as it is an average.

But also for the downsampling case, if a source is flaky at the beginning, we might consider we are actually upsampling the source, because the data rate is too low. Once it recovers, it should be switched to downsampling.

I'm not saying this is what we want, I'm just saying this is what the resampler is doing right now.