dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.7k stars 1.48k forks source link

Add `rrule`-based scheduler #9801

Open erinov1 opened 2 years ago

erinov1 commented 2 years ago

What's the use case?

I use dagster to process vendor data from different countries with different business holidays and business hours on an intraday cadence. It's still a bit tricky to use dagster's cron-based scheduler for this. I noticed that Prefect's scheduler accepts iCalendar-based rrules/rruleset (implemented in the dateutil rrule module). I was able to (de)serialize these rulesets via pickling, so presumably they could be passed to the dagster daemon independently of user code, and can generate datetime iterators (so can probably be subbed in for the existing croniter iterators without a ton of work). They also allow for sufficiently complex logic to handle all the scheduler use cases I can think of. I think this would be a great feature for dagster.

Ideas of implementation

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

clairelin135 commented 2 years ago

Hi @erinov1! Thanks for this writeup, I definitely think that supporting rrule would make it much easier to express certain cadences for schedules, like an every other week cadence. We do have an existing issue that I'd like to link to this: https://github.com/dagster-io/dagster/issues/8713

We don't have any immediate plans on the roadmap to implement this feature, but if you're interested in working on it I can assist by answering questions and reviewing PRs.

erinov1 commented 2 years ago

Thanks @clairelin135 ! I'm trying to figure out if this is something I could tackle. My understanding is that the host process reconstructs a schedule from a repo through ExternalScheduleData. I guess one could add an rrule_schedule attribute in addtion to cron_schedule, but that would muddy things downstream (the schedule instigator data also has a cron_schedule attributre), so maybe a different level of abstraction would be needed. Also, would a custom serializer be needed to pass an rruleset (which is essentially a nested dictionary of string-representable rules) as a byte string, or is a string representation needed?

clairelin135 commented 2 years ago

@erinov1, yep, it is complicated given that we store cron_schedule in ScheduleInstigatorData, which is then stored in our schedule database to monitor the current status of a schedule (running or stopped), though not sure what else we could do other than add the additional rrule_schedule field and add checks to ensure only one of rrule or cron is set.

Throughout most of our system we generally serialize objects in named tuples, and create a custom serializer if needed. One example is InstigatorState (code here).

erinov1 commented 2 years ago

@clairelin135 Maybe just having a single schedule argument that could be either a cron string or rrule would be simpler, but I'm assuming backcompatibility would be a concern. Otherwise, if you are ok with just adding another rrule_schedule field, then I can give this a shot. I added support for cron_schedule to accept lists (unions) of cron strings, and this seems essentially analogous to those changes. The only part I need to dig deeper into is the serialization bit (that, and I don't know anything about the dagit side — for example, the dagit UI does not render lists of cron strings correctly).

clairelin135 commented 2 years ago

@erinov1 Yeah, I think having the single schedule argument would cause backcompat issues, because the cron_schedule field is serialized in storage and an error might occur if updated versions of Dagster/Daemon attempt to read the new schedule field.

I think adding the rrule_schedule field is appropriate, though if you put out a PR I'll need to get a second look from someone else on the team.

On the Dagit side, Dagit assumes that the cron string is a single string, which is probably why the list is not rendered correctly. We'd probably add an additional field to GrapheneSchedule if we added rrule_schedule.