concourse / time-resource

a resource for triggering on an interval
Apache License 2.0
44 stars 32 forks source link

Added predictable generation of versions on an interval #51

Closed owenfarrell closed 4 years ago

owenfarrell commented 4 years ago

Just to provide some context as to how I'm thinking about this, here's a draft of what I'm thinking. This certainly isn't ready for consumption, but it's at least at a point where the existing unit tests pass.

Fixes #45

xtremerui commented 4 years ago

Hi thx for the PR. What is the reason for time.Truncate() in some tests?

owenfarrell commented 4 years ago

@xtremerui Great question!

That actually traces back to one of the assumptions I made in the associated issue (A2).

Tests that define a start now need to take in to account that "now" (in the test) isn't going to land on an exact interval relative to the start.

owenfarrell commented 4 years ago

Okay, I think this is ready for other sets of eyes to read in detail.

I moved the logic for calculating the most recent, valid version and the list of valid versions (since a specified time) into TimeLord so that out could leverage the same logic and ensure that versions generated by put are detectable by check. And if/when a new at attribute is added to the source configuration, it clarifies exactly where/how that logic needs to be implemented.

After the refactor, there were ~10~ 11 existing check_test contexts that failed, and they all boiled down to one of ~3~ 4 reasons:

Check
    when executed
        when a time range is specified
            when we are in the specified time range
                when a version is given
                    when the resource was triggered yesterday near the end of the time frame
                        [It] outputs a version containing the current time and supplied version 
  1. In this scenario, the existing test configuration sets prev to the same value as stop on the previous day. But the logic implemented in TimeLord.Check enforces that versions are before stop. Since the updated implementation now returns versions that are strictly valid, the invalid version provided via prev is no longer returned.
Check
    when executed
        when a time range is specified
            when we are in the specified time range
                when an interval is specified
                    when no version is given
                        [It] outputs a version containing the current time 
  1. Per my assumption in #45, the current time does not represent a valid version.

A2: When defining an interval and a start, new versions will be generated based on the start time

Check
    when executed
        when a time range is specified
            when we are in the specified time range
                when an interval is specified
                    when a version is given
                        when the interval has not elapsed
                            [It] outputs only the supplied version 
  1. This is a similar root cause to (2). In this scenario, the "supplied version" is calculated relative to the current time (not the start time). Since the "supplied version" is not a valid version, it is not returned by check.
Check
    when executed
        when a time range is specified
            when we are in the specified time range
                when an interval is specified
                    when a version is given
                        when the interval has elapsed
                            [It] outputs a version containing the current time and supplied version 
  1. Same root cause as (3).
Check
    when executed
        when a time range is specified
            when we are in the specified time range
                when an interval is specified
                    when a version is given with its time N intervals ago
                        [It] outputs a version containing the current time and supplied version 
  1. Same root cause as (3)
Check
    when executed
        when a time range is specified with a location configured
            when we are in the specified time range
                when a version is given
                    when the resource was triggered yesterday near the end of the time frame
                        [It] outputs a version containing the current time and supplied version 
  1. Same root cause as (1)
Check
    when executed
        when a time range is specified with a location configured
            when we are in the specified time range
                when an interval is specified
                    when no version is given
                        [It] outputs a version containing the current time 
  1. Same root cause as (2)
Check
    when executed
        when a time range is specified with a location configured
            when we are in the specified time range
                when an interval is specified
                    when a version is given with its time within the interval
                        [It] outputs the given version 
  1. Same root cause as (3)
Check
    when executed
        when a time range is specified with a location configured
            when we are in the specified time range
                when an interval is specified
                    when a version is given with its time one interval ago
                        [It] outputs a version containing the current time and supplied version 
  1. Same root cause as (3)
Check
    when executed
        when a time range is specified with a location configured
            when we are in the specified time range
                when an interval is specified
                    when a version is given with its time N intervals ago
                        [It] outputs a version containing the current time and supplied version 
  1. Same root cause as (3)
Check
    when executed
        when an interval is specified
            when a version is given
                with its time N intervals ago
                    [It] outputs a version containing the current time and supplied version
  1. This behavior changes per my assumption in #45.

A1: When provided with a previous version, check should a list of all valid versions since the specified version

vito commented 4 years ago

Before I forget: as part of doing this, I think we should also ensure that if interval is used, the resource returns a unique interval for each pipeline. This can be done by hashing the metadata provided in the environment and using that to offset the interval. (Specifically, just the pipeline name.)

This step is important so that all pipelines configured with interval: 10m don't trigger their jobs at the exact same time. The long-term plan is to convert this time resource into a time var source prototype, which would be even more fine-grained (per-job instead of per-pipeline). This approach is documented in RFC #27 which I hope to merge soon.

owenfarrell commented 4 years ago

@vito Well that's a sneaky requirement to inject in to this PR. 😉

Just to elaborate on your ask, I'm thinking that the team + pipeline name would generate the seconds + milliseconds associated with versions that are generated by check. Is that in line with your thinking?

Just to confirm, the change you're proposing would only apply when interval is defined and start/stop is not defined. Do I have that right? If somebody really wants multiple pipelines to use the same versions, they could always set { start: 00:00, stop: 11:59, interval: 2m } to generate an identical, predictable set of versions.

vito commented 4 years ago

@owenfarrell Haha, yeah sorry for the sudden scope-creep. I realized too late that if we make the times predictable without keeping them unique per-pipeline this would likely result in a regression as many jobs would suddenly start triggering all at the same time.

My thinking is more in line with the H/N syntax used by Jenkins. A configuration of interval: 10m would be akin to H/10, meaning "run every 10 minutes starting at any time between :00 and :09". This start time could be derived from a numeric hash of the pipeline name/team name/etc., modulo the interval. Or something like that. Numbers are hard.

I haven't given any thought to how it should behave with start and stop. I believe start and stop are typically used to enforce boundaries on an otherwise unique interval, (e.g. "I want to run stress tests every 5 minutes but only in off hours") - so it seems like it should apply there too. There may be some risk in not being able to fit as many versions in to the time span, but I think that's an acceptable tradeoff.

owenfarrell commented 4 years ago

@vito - Your H/N analogy makes complete sense, but I would think that kind of logic would/could/should be implemented in how Concourse schedules check invocations and not in the check implementation itself.

One of the things that I tried to keep in mind when implementing this is divorcing the idea between triggering a pipeline and versions generated by a resource. While time is (almost certainly) used as a trigger in an overwhelming number of situations, there is always a gap between the start time of a pipeline and the version associated with that trigger.

Predictability was the ask of the OP who raised the issue, and what I've proposed here makes the verisons predictable. Honestly, I don't think it does anything around when those versions are generated - for better or worse.

vito commented 4 years ago

@owenfarrell The trouble is it's impossible to divorce the version history from scheduling, as version history is precisely the thing that drives scheduling - not the timing of the checks themselves.

There's a kind of deep history to this problem and it ties all the way back to concourse/concourse#2386 which introduced the idea of sharing version history across the cluster for equivalent resource definitions. After implementing it we realized that sharing the history for the time resource would result in the thundering herd problem I mentioned. We worked around that by allowing resource types to tell Concourse that they have a unique version history.

Later, we introduced LIDAR (written largely by the OP of #45), which implemented a new queue-based method of performing checks. This introduced a new problem: because the time resource just returns the current time, if multiple checks were queued with the same "from" version (which would happen if someone manually triggered a job downstream of the resource), it would result in two different versions being saved since they both just return the current timestamp. I believe this was the original motivation for #45 - so that if those two checks ran, they would both produce the same version and not result in duplicates with subtly different timestamps.

All in all, I think the time resource is a square peg fitting into a round hole, and in the long term I think concourse/rfcs#27 is much closer to what people are actually trying to accomplish with timed triggers. Once it's done we can do away with the unique_version_history hack and even make the triggering per-job instead of per-pipeline.

Does this help? 😅

owenfarrell commented 4 years ago

So I thought about this, slept on it, and thought about it a bit more. But I know I'm still playing catchup given all the context that you shared.

(from #45)

However, if the check gets delayed then the gap between versions is unpredictable.

emit versions at set times based on the interval specified... This way regardless of when the check actually runs, we end up with a deterministic set of versions.

When I read the ask, there's no mention of concurrent/queued checks. The concern is more about when checks got missed. The scenario I always had in my head (to emulate the ask) would be when check_every is set to be greater than the interval.

The trouble is it's impossible to divorce the version history from scheduling, as version history is precisely the thing that drives scheduling - not the timing of the checks themselves.

There's one part of this that I haven't connected in my mental model. Does Concourse queue the next execution of check for a time resource based on previous version(s) emitted? My assumption was "no", but your comment about not divorcing scheduling and emitted versions gives me pause.

Could you take another look at assumption A2 that I wrote in the issue - and specifically the example that I provided? If the answer is "yes, Concourse queues check based on versions"... then that assumption definitely goes out the window.

vito commented 4 years ago

Hm, yeah it doesn't look like the LIDAR motivations were specifically called out in #45, but it definitely came from new behavior we found during its development. The checking delay you quoted was actually the delay between a check being queued and actually running. The original issue doesn't mention duplicate versions, but we may have just not identified that behavior yet. But as https://github.com/concourse/time-resource/issues/45#issuecomment-635772667 indicates, end users are running into it.

I'm not 100% sure what you mean by queuing checks based on versions, but I think the answer is yes. Concourse always takes the latest version and then queue a check from that version, treating it like a cursor. This is how resource types like git return all commits instead of just whatever's at HEAD every minute.

Given that, the duplicate versions problem occurs when the periodic checker queues a check from the current version (say, 1:00:00), but then someone manually triggers a job downstream of the time resource before the check runs. Manually triggered builds force a check of all inputs to make sure they run with the latest versions available, so this will queue another check from the same version. Assuming an interval of 10m, this could result in versions like 1:10:01 and 1:10:03 being saved since both checks just return the current timestamp.

The job scheduler works by computing the next set of versions available for each input (based on version history and passed constraints) and comparing it to the latest build's input versions. If any input has a different version and is marked trigger: true, a new build is created. With a duplicate version saved, it's possible for a build to have run with 1:10:01 and then another build to run with 1:10:03 based on the timing of the scheduler.

In any case, the approach described by #45 resolves all these inconsistencies; if the version history is deterministic, it doesn't matter how long a check is queued before finally running, it doesn't matter if duplicate checks with the same "from" version are queued, and it doesn't matter if the checking interval is configured higher than your time resource interval - the resource will always return the same history.

Looking back at A2, I think having interval "reset" to the start time makes sense, but it shouldn't completely ignore the "from" version. If I run a check from a version whose timestamp was 2 days ago, I would expect it to return all the timestamps that were missed.

Sorry this takes so much context to really understand, but thanks for hanging in there and asking all the right questions. 🙂 If it starts to feel like this is more than you bargained for, I completely understand, but I'm also happy to keep providing more context until things make sense.

owenfarrell commented 4 years ago

@vito - I'm happy to carry on with this effort as long as you think its an effort worth carrying on.

Thank you so much for the writeup. I really appreciate all the context that you're able to provide to fill in the blanks in my head. Most of what you reinforced aligns with my understanding, but it still helps to hear it played back.

Given that, the duplicate versions problem occurs when the periodic checker queues a check from the current version (say, 1:00:00), but then someone manually triggers a job downstream of the time resource before the check runs.

So this implementation definitely avoids that problem altogether as both invocations would return the exact same value(s).

Looking back at A2, I think having interval "reset" to the start time makes sense, but it shouldn't completely ignore the "from" version. If I run a check from a version whose timestamp was 2 days ago, I would expect it to return all the timestamps that were missed.

One of the existing unit tests to cover that specific scenario. The test is a bit of an exaggerated scenario as there is a year in between executions, but it proves the point.

So the key question still boils down to.... when does this resource need to generate an offset to avoid a thundering herd situation?

Scenario 1: Interval Only

The current implementation in this PR generates versions based on exact intervals of the previous version. So every version will be n multiples of the interval away from the first time.

Implementing an offset when the calculating the first version would ensure some level of distribution across the interval. This is definitely the scenario where I see the most benefit.

Scenario 2: Range Only

More than anything, the version of a range-only configuration will be driven by check_every.

Implementing an offset when the calculating each version would ensure some level of distribution across the range.

Scenario 3: Interval + Range

This is the most specific configuration (duh!), and given our assumption to rebase versions based on the start time each day makes me think that we shouldn't calculate/apply an offset here since any offset would be sub-minute... and that's a level of precision that would have little-to-no impact.

Sound about right?

vito commented 4 years ago

@owenfarrell Definitely! This is something I've been wanting to get around to for a while, and it's an important piece to the puzzle of enabling global resources by default.

So this implementation definitely avoids that problem altogether as both invocations would return the exact same value(s).

Yep exactly.

Scenario 2: Range Only

Hmm I'm having trouble understanding this scenario; I wouldn't expect it to be acutely affected by check_every. It currently just returns a single version once it's reached the time range, and keeps returning the given version as long as it's in the "current range" rather then returning a new version each time:

~/src/concourse master*
❯ echo '{"source":{"start":"6:00 PM","stop":"7:00 PM","location":"America/New_York"}}' | docker run -i concourse/time-resource /opt/resource/check
Unable to find image 'concourse/time-resource:latest' locally
latest: Pulling from concourse/time-resource
2004ee2154b4: Already exists
4496b7ccd935: Pull complete
02ef3f7ad2ad: Pull complete
Digest: sha256:1bb4bd616f2532c5e2017e89e64639de7bb3dee8f594392d5d66bdba3b658dc8
Status: Downloaded newer image for concourse/time-resource:latest
[{"time":"2020-06-18T18:21:18.1496387-04:00"}]

~/src/concourse master*
❯ echo '{"source":{"start":"6:00 PM","stop":"7:00 PM","location":"America/New_York"},"version":{"time":"2020-06-18T18:21:18.1496387-04:00"}}' | docker run -i concourse/time-resource
/opt/resource/check
[{"time":"2020-06-18T18:21:18.1496387-04:00"}]

In general, resource behavior should be pretty independent of check_every - that's more of an optimization knob to turn, typically to reduce the checking activity. This can also be set to a different default at the operator level, so having that impact pipeline semantics would be a no-no.

I think this is actually an easier case to make unique - given the start: and stop: range, we can choose any timestamp within that range and have that be the time that the resource returns every day.

It seems worth noting here that going with start:/stop: instead of at: (which I suggested in https://github.com/concourse/time-resource/issues/45#issuecomment-641552439) allows for this kind of flexibility in the time returned. Maybe we should just stick with it!

Scenario 3: Interval + Range

I think it's worth thinking about making the series unique in this scenario too: one of the cases for this configuration is to run expensive test suites during off hours, and that seems like something that would benefit from reducing the thundering herd.

If I configure interval: 10m, start: 2:00 AM, stop: 6:00 AM, I think it could return something like this:

I may be missing some interesting scenarios here - for example it may be possible that this decision to offset the time range results in N-1 versions being returned as the offset pushes the last one past the stop time, but I think that's OK.

owenfarrell commented 4 years ago

It currently just returns a single version once it's reached the time range, and keeps returning the given version as long as it's in the "current range" rather then returning a new version each time

I think this is actually an easier case to make unique - given the start: and stop: range, we can choose any timestamp within that range and have that be the time that the resource returns every day.

You are absolutely right and I agree with your conclusion.

But check_every still matters a great deal here. And I think we need to confirm the implications are what we want (or broadly disclose the impacts).

Implementation As of v1.2.1

Take the following configuration as an example, and assume perfect scheduling of check invocations by Concourse and no manual triggers of jobs:

resources:
- name: after-midnight
  type: time
  source:
    start: 12:00 AM
    stop: 1:00 AM

jobs:
- name: something-after-midnight
  plan:
  - get: after-midnight
    trigger: true
  - task: something
    config: # ...

Changing check_every will effectively shrink the range of possible versions generated by check. And the version will be inserted in to the version history when it is generated.

check_every Version Range Insertion Time
1m 12:00 12:00
5m 12:00 - 12:04 12:00 - 12:04
15m 12:00 - 12:14 12:00 - 12:14
30m 12:00 - 12:29 12:00- 12:29
60m 12:00 - 12:59 12:00- 12:59

If you take away the "perfect scheduling of check" assumption, then check_every defines your factor of safety (how much of an unexpected delay between check invocations you build in).

Deterministic

Pivot now towards an implementation where the minutes are calculated based on a deterministic function so that versions will be distributed across the entire range. Modifying check_every impacts when those versions are generated (i.e. when check is in invoked)

check_every Version Range Insertion Time
1m 12:00 - 12:59 12:00 - 1:00
5m 12:00 - 12:59 12:00 - 1:04
15m 12:00 - 12:59 12:00 - 1:14
30m 12:00 - 12:59 12:00- 1:29
60m 12:00 - 12:59 12:00- 1:59

There's already a disclaimer about precision, but I think this is taking that to an extreme. It's not backwards breaking, per se. But it represents a pretty significant deviation from current state.

It seems worth noting here that going with start:/stop: instead of at: (which I suggested in #45 (comment)) allows for this kind of flexibility in the time returned. Maybe we should just stick with it!

That's a great point. Introducing at runs counter to this whole thread we're pulling and re-introduces some of the issues we're trying to address.

I may be missing some interesting scenarios here - for example it may be possible that this decision to offset the time range results in N-1 versions being returned as the offset pushes the last one past the stop time, but I think that's OK.

The scenario that I always have in my head goes something like this: As a DevOps engineer, I want to trigger a job that aggregates logs in 10-minute intervals.

Having timestamps that line up neatly on the :00 (with { trigger: true, version: every } on the get step) means that there's 1 less calculation that needs to be performed throughout the job. Having all the timestamps line up on the :07 or :09 (based on the job/pipeline name) means doing some math in there. 'Cause... you know...

Numbers are hard.

But maybe that's too much of a fringe scenario.

Speaking of which - that means a change in job name or pipeline name will change the offset. We should probably write that down somewhere.

vito commented 4 years ago

@owenfarrell Good points. The rule of thumb is "if you need precision, use cron instead" - so that the time resource is really just used for periodic tasks that don't need such precision.

I think it's worth going forward even with all that in mind. In my experience check_every is typically left alone, and the cluster-wide default is typically left at 1 minute, which I think is good enough. If it breaks due to the cluster-wide default, the user can specify check_every to work around it. Admittedly it's unfortunate that a cluster-wide default required a change to user pipelines, but at least the end result is more explicit and portable. Another way to go about it could be to allow resource types to "suggest" a checking frequency, but that may be opening a Pandora's box.

Speaking of which - that means a change in job name or pipeline name will change the offset. We should probably write that down somewhere.

Yeah - the README should definitely document all this. 👍

owenfarrell commented 4 years ago

So I started coding this up, but I really struggled with how the changes made a hash of the unit tests.

</DadJokes>

The root cause of the unit test failures boiled down to:

  1. some fringe scenarios introduced by the complexity of this PR.
  2. the unit tests evaluate against now (the execution time) which varies, but adding a modulo to every version creates fixed values.

To solve (1), I think there is an opportunity to simplify [read: rip out] some of the logic that I've added in this PR. I've opened #52 to explicitly implement default values for start and stop that this resource is implying in their absence. Making that change in isolation - with minimal changes to the unit tests - proves that this resource is acting on some implied default values.

To solve (2), I think the solution is to pin now to a specific point in time, similar to how TimeLord defines an absolute time for each scenario so that a predictable input generates a predictable output.

owenfarrell commented 4 years ago

@vito - Ensuring an even distribution of versions (over an interval/range) is really throwing me for a loop. I'm not sure that its possible to write a set of unit tests that are both robust enough to cover all the scenarios, yet resilient enough to run any time.

So while I've got an implementation ready to go, and I think it's mostly there, demonstrating that it does what its supposed to do is the harder part.

My approach is to hash the team name and pipeline name, and then offset each generated version by the number of milliseconds. I found a magic combination of team and pipeline names that gets me a small offset (7m), but that means that some tests will fail between 00:00 UTC and 00:07 UTC because the test data impacts the function differently during that window. Same goes for the window between 00:00 America/Indiana/Indianapolis and 00:07 America/Indiana/Indianapolis.

In order to avoid a complete and total overhaul in a single PR, I think we can break this down in to 4-5 smaller milestones such that only the last really impacts the behavior of this resource.

  1. Implement the de facto defaults in code (#52)
  2. Add functions to TimeLord to generate the Latest version and the List of missing versions. (#53)
  3. Implement an independent modulo/offset calculator
  4. Refactor the existing commands to use a command pattern, similar to that used by other resources in the concourse org. Moving away from the bona fide execution of the compiled binaries would allow injection of an alternative time provider (e.g. var GetCurrentTime = time.Now by default, resource.GetCurrentTime = time.Date(...) in the unit test) and ensure predictable results no matter when/where you run them from.
  5. Integrate the modulo/offset calculator in to executables

The net effect of these changes wouldn't be realized until the very last step, but it would avoid an oversized PR (like this one has become) and avoid the need for a long-running parallel branch (in case other PRs or fixes come through).

Thoughts?

vito commented 4 years ago

@owenfarrell Sorry for the wait! Will get back to this soon; took time off last week.

owenfarrell commented 4 years ago

@vito - All good. Happy (belated) Canada Day!

vito commented 4 years ago

@owenfarrell 👍 That plan sounds good. Happy to review however many PRs it takes to tidy things up and make this easier to implement. I'm still a bit confused by #52 but I left another comment on it. Thanks for your patience and diligence!

owenfarrell commented 4 years ago

Closing as superseded by #53 and #54