Closed ravenac95 closed 1 month ago
Actually upon submitting this, I think the solution is to instead ensure that the cron_next
of the IntervalUnit
to always take into account the original cron as opposed to the IntervalUnit's label of month/day/hour/5-minute
.
this is by design, cron is simply how often a job should run, interval unit is the actual granularity of each interval, so even if a job runs weekly, it's interval unit will be daily.
i'm not exactly sure what the issue is, but happy to continue discusing. closing for now.
We have some models that process timeseries data and set the
batch_size
to 1 (though this may not be entirely necessary all the time for us). When we looked at the resulting table it seemed to be missing quite a bit of data inexplicably. Upon investigation, I realize thebatch_size
doesn't actually work as expected based on the documentation:I would expect that for say a time range from
2024-01-01
to2024-01-07
that I'd get a single job that executes a single query for that week but instead I see 7 jobs.Part of this has something to do with the
IntervalUnit
being limited to aDAY
and the next being aMONTH
. I toyed with a fix that allows for aWEEK
interval and I believe that would work for this specific case but I'm not sure if there is a better solution for this as it seems like this won't exactly scale properly to things where you might have say a cron that is every 3 days like0 0 */3 * *
. Due to the way the backfill works with theINCREMENTAL_BY_TIME_RANGE
materialization, if you were to set thebatch_size = 1
this could cause some data to be lost because it would generate a backfill for every day within that 3 day period. Assuming you were somehow bucketing results into that 3 day time interval, then the finalBETWEEN {start} and {end}
of the insert query would result in data loss because the{start}
and{end}
values used would be singular day which if used withstart_ds/start_date/end_ds/end_date/etc
would be the wrong dates.I may just open a PR with my
IntervalUnit
fix to start but I think the discussion here would be useful to find the best answer.