ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.81k stars 220 forks source link

Fix checkpoint cleanup failure (#688) #689

Closed mwylde closed 4 months ago

mwylde commented 4 months ago

Fixes a failure that could occur in checkpoint cleanup in situations where a table exists in one epoch but not in a previous epoch. To clean up a checkpoint, we follow the following procedure:

  1. Get the metadata for the "new min" epoch (the oldest one that won't be cleaned) and look at all of the files that it references
  2. For each epoch that we are cleaning, get their metadata and look at all of the files that they reference
  3. For every file in (2) that's not in (1), delete it

To actually determine the files that are referenced, we have to look at the table metadata to figure out the table type and config. That involves looping through all of the tables referenced in a particular operator checkpoint. However, it turns out there was a subtle bug where we were using the metadata from (1) for each iteration of (2). That meant that if there was an table that existed only in the new_min epoch but not in the previous checkpoints, we would fail to find it in the older one and panic.

The fix is to ensure we are always iterating over the tables of the epoch that we're cleaning.

Closes #688