TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.
https://sqlmesh.com
Apache License 2.0
1.82k stars 160 forks source link

Setting a cron of `@weekly` with `incremental_by_time_range` doesn't backfill properly if `batch_size` is set less than 7 #3200

Closed ravenac95 closed 1 month ago

ravenac95 commented 1 month ago

We have some models that process timeseries data and set the batch_size to 1 (though this may not be entirely necessary all the time for us). When we looked at the resulting table it seemed to be missing quite a bit of data inexplicably. Upon investigation, I realize the batch_size doesn't actually work as expected based on the documentation:

For example, if a model specifies a cron of @hourly and a batch_size of 12, when backfilling 3 days of data, the scheduler will spawn 6 jobs. (3 days * 24 hours/day = 72 hour intervals to fill. 72 intervals / 12 intervals per job = 6 jobs.)

I would expect that for say a time range from 2024-01-01 to 2024-01-07 that I'd get a single job that executes a single query for that week but instead I see 7 jobs.

Part of this has something to do with the IntervalUnit being limited to a DAY and the next being a MONTH. I toyed with a fix that allows for a WEEK interval and I believe that would work for this specific case but I'm not sure if there is a better solution for this as it seems like this won't exactly scale properly to things where you might have say a cron that is every 3 days like 0 0 */3 * *. Due to the way the backfill works with the INCREMENTAL_BY_TIME_RANGE materialization, if you were to set the batch_size = 1 this could cause some data to be lost because it would generate a backfill for every day within that 3 day period. Assuming you were somehow bucketing results into that 3 day time interval, then the final BETWEEN {start} and {end} of the insert query would result in data loss because the {start} and {end} values used would be singular day which if used with start_ds/start_date/end_ds/end_date/etc would be the wrong dates.

I may just open a PR with my IntervalUnit fix to start but I think the discussion here would be useful to find the best answer.

ravenac95 commented 1 month ago

Actually upon submitting this, I think the solution is to instead ensure that the cron_next of the IntervalUnit to always take into account the original cron as opposed to the IntervalUnit's label of month/day/hour/5-minute.

tobymao commented 1 month ago

this is by design, cron is simply how often a job should run, interval unit is the actual granularity of each interval, so even if a job runs weekly, it's interval unit will be daily.

i'm not exactly sure what the issue is, but happy to continue discusing. closing for now.