Shopify / camus

Kafka->HDFS pipeline from LInkedIn. It is a mapreduce job that does distributed data loads out of Kafka.
7 stars 4 forks source link

Removing Watermark related jobs #149

Closed dterror-zz closed 6 years ago

dterror-zz commented 6 years ago

Removes job that generated watermark files and job that checked for late-arriving-data (based on a naïve interpretation of watermarks).

Removing them since they're not being used anymore.

c.c @cfournie @kmtaylor-github PR phasing it out in *S got merged this morning, when do you think we're good to stop producing watermarks?

cfournie commented 6 years ago

Let's wait until Thursday to make sure that this change sticks

dterror-zz commented 6 years ago

Good to go @cfournie @kmtaylor-github ?

kmtaylor-github commented 6 years ago

I think so. I hacked up this query to monitor cases where we disagreed with Camus: https://logs.shopify.io/en-US/app/data/search?sid=1523021551.16032_0C7CF512-2D40-4AC5-8FDE-E56474880ED3

Most are slightly ahead of the watermark; an example:

Topic Folder Max Time Watermark Time
trekkie.storefront 2018/04/05/21 04/06/2018 02:06:42.345 3:21
trekkie.storefront 2018/04/05/22 04/06/2018 03:06:57.861 4:21
trekkie.storefront 2018/04/05/23 04/06/2018 04:06:14.195 5:22

There are also cases where it looks like watermarking is disabled or very far behind (e.g. iq.delivery.backfill.request)

dterror-zz commented 6 years ago

Awesome. iq.delivery.backfill.request is a very problematic topic. It's no longer whitelisted and it had issues right before we un-whitelisted it (they deleted and re-created the topic) which is why we probably didn't have a chance to watermark it. This is a case in which *S algo is more accurate.

cfournie commented 6 years ago

Good to go 👍