moving scheduling up to the cron layering

fgregg commented 4 years ago

We can move some of the scheduling logic we are handling in the python code up to the cron layer.

Taking as the example, fast_full_scraping

The current cron schedule is

0,5 * * * 5,6 or "At 0 and 5 minutes past the hour, only on Friday and Saturday"

The job has the following scheduling logic.


def fast_full_scraping():
    if IN_SUPPORT_WINDOW():
        now = datetime.now()

        if now.minute < 5:
            return 'fast_full_event_scrape'

        elif now.minute >= 5:
            return 'fast_full_bill_scrape'

    else:
        return 'no_scrape'

the first branch checks to see if we are in a support window.

we could replace this check with the the two explicit crons

0,5 21-23 * * 5 0,5 0-5 * * 6

The next two checks look to see if we are at the top of the hour or after five minutes. we could handle these conditions with two more crons a piece for friday and saturday

0 21-23 * * 5 => fast_full_event_scrape 5 21-23 * * 5 => fast_full_bill_scrape 0 0-5 * * 6 => fast_full_event_scrape 5 0-5 * * 6 => fast_full_bill_scrape

This all assumes that you can have the same task associated with more than one cron, and I don't know if that's an airflow kind of thing to do.

fgregg commented 4 years ago

Related feature request: https://github.com/apache/airflow/issues/8649

fgregg commented 4 years ago

Mailing list discussion: https://lists.apache.org/list.html?dev@airflow.apache.org:lte=1y:mauricio Aborted PR: https://github.com/apache/airflow/pull/9091

hancush commented 4 years ago

@fgregg I spiked this in #60. I also spiked an alternative approach – conceptually separating DAG scheduling from task branching – in #61.

Both PRs include brief discussion of the pros and cons. I'm interested in your reaction.

fgregg commented 4 years ago

Looking at both of these, i think I prefer the flatter style to the branched style.

I like being able to just look at a single place to know when a job should be run and what precise job it will be, and I like being able to adjust the scheduling without affecting other jobs.

The approach you take in #61, definitely makes for cleaner code but the branching operator is still spreading around scheduling responsibility.

My main concern with #60 is that you have very similar DAG definitions, which we are going to want to keep lined up. Code drift and maintainability are going to be annoying.

If we can address that, perhaps with a DAG factory function, perhaps with another approach, then the flat approach wins, in my opinion.

If we can't, or only can by writing code that is too clever to maintain, then the approach in #61 is better.

hancush commented 4 years ago

Thanks for your analysis, @fgregg! I've done some more thinking and talking, and I think I'm inclined to agree with you that the DAG factory is the best solution. The big reason is that it allows us to specify the type of scrape that ran in the name of the DAG, which is really important for Shelly and Omar's user experience in the dashboard view.

I definitely see what you mean about cleverness, as well. One pattern I've seen is to maintain a config object defining all of the DAGs, then feed it into a very straightforward loop to generate them programmatically. That might look like:

DAG_CONFIG = {
    'regular_bill_scrape': {
        'schedule_interval': 'foo',
        'docker_environment': {
            'TARGET': 'bills',
            'WINDOW': 0.05,
        },
    },
    'broad_bill_scrape': {
        'schedule_interval': 'foo',
        'docker_environment': {
            'TARGET': 'bills',
            'WINDOW': 1,
            'RPM': 0,
        },
    },
    'full_bill_scrape': {
        'schedule_interval': 'foo'
        'docker_environment': {
            'TARGET': 'bills',
            'WINDOW': 0
        },
    },
    # etc.
}

for dag, config in DAG_CONFIG.items():
    # create the dag, add the task, add it to globals()

This may make more sense if we actually spiked it, so I'll add that to the TODO list.

hancush commented 3 years ago

@fgregg I spiked what I was talking about in https://github.com/datamade/la-metro-dashboard/pull/62. Let me know what you think!

hancush commented 3 years ago

Done in #62.

Metro-Records / la-metro-dashboard

moving scheduling up to the cron layering #59