Shopify / camus

Kafka->HDFS pipeline from LInkedIn. It is a mapreduce job that does distributed data loads out of Kafka.
7 stars 4 forks source link

Add camus watermark #60

Closed dterror-zz closed 8 years ago

dterror-zz commented 8 years ago

Problem?

By simply looking at the folders and files dropped by Camus we can't really know what's fully processed and what's not. Our solution so far was implementing an arbitrary 3 hour window (relative to now).

Solution

This adds a second job to the Camus flow that will inspect Camus's execution metadata and tag completed hourly folders with a _IMPORTED file. This should be used as a better watermark for our data processing.

How?

I extract the base code for that from a Wikimedia open source project. I had to change a few things and we decided it was better to extract that and move into our camus-shopify instead of having a separate project.

Follow-up

I'm working on a PR to add this knowledge to *scream.

@drdee @airhorns

drdee commented 8 years ago

:ship: