By simply looking at the folders and files dropped by Camus we can't really know what's fully processed and what's not. Our solution so far was implementing an arbitrary 3 hour window (relative to now).
Solution
This adds a second job to the Camus flow that will inspect Camus's execution metadata and tag completed hourly folders with a _IMPORTED file. This should be used as a better watermark for our data processing.
How?
I extract the base code for that from a Wikimedia open source project. I had to change a few things and we decided it was better to extract that and move into our camus-shopify instead of having a separate project.
Follow-up
I'm working on a PR to add this knowledge to *scream.
Problem?
By simply looking at the folders and files dropped by Camus we can't really know what's fully processed and what's not. Our solution so far was implementing an arbitrary 3 hour window (relative to now).
Solution
This adds a second job to the Camus flow that will inspect Camus's execution metadata and tag completed hourly folders with a
_IMPORTED
file. This should be used as a better watermark for our data processing.How?
I extract the base code for that from a Wikimedia open source project. I had to change a few things and we decided it was better to extract that and move into our
camus-shopify
instead of having a separate project.Follow-up
I'm working on a PR to add this knowledge to *scream.
@drdee @airhorns