Shopify / camus

Kafka->HDFS pipeline from LInkedIn. It is a mapreduce job that does distributed data loads out of Kafka.
7 stars 4 forks source link

Missing checkout data in HDFS #57

Closed pmangg closed 8 years ago

pmangg commented 8 years ago

We're missing some checkout data in HDFS (https://github.com/Shopify/starscream/issues/8162) and it looks to be shop-specific, i.e., a set of shops stopped having checkout kafka data since the 26th. Digging in further, this data is in Kafka but it's just not dropped to HDFS. In a Camus log, I see:

02-12-2015 15:30:29 EST Camus INFO - [CamusJob] - Offset range from kafka metadata is outside the previously persisted offset, checkout    uri:tcp://kafka08.chi.shopify.com:9092    leader:8    partition:12    earliest_offset:240498314    offset:346216743    latest_offset:332230367    avg_msg_size:1343    estimated_size:-18783702968
02-12-2015 15:30:29 EST Camus INFO -  Topic checkout will be skipped.
02-12-2015 15:30:29 EST Camus INFO -  Please check whether kafka cluster configuration is correct. You can also specify config parameter: kafka.move.to.earliest.offset to start processing from earliest kafka metadata offset.

The first Camus run where an instance of this log started occuring is in https://azkaban.data.shopify.com/executor?execid=149416&job=Camus and we stopped getting checkout kafka data in HDFS for that set of shops from then onwards.

cc @Shopify/data-acquisition @angelini

pmangg commented 8 years ago

Is there any way we can restart Camus from that day to backfill the missing data? It is used by reportify-checkout-events-view and the checkout fact tables (home cards, tableau, etc).

drdee commented 8 years ago

So you are only missing data for partition 12? I Will make a Back-up of that data right now As we are close to the 7 day buffer that's kept on the brokers.

Sent from my iPhone

On Dec 2, 2015, at 20:31, Putra Manggala notifications@github.com wrote:

We're missing some checkout data in HDFS (Shopify/starscream#8162) and it looks to be shop-specific, i.e., a set of shops stopped having checkout kafka data since the 26th. Digging in further, this data is in Kafka but it's just not dropped to HDFS. In a Camus log, I see:

02-12-2015 15:30:29 EST Camus INFO - [CamusJob] - Offset range from kafka metadata is outside the previously persisted offset, checkout uri:tcp://kafka08.chi.shopify.com:9092 leader:8 partition:12 earliest_offset:240498314 offset:346216743 latest_offset:332230367 avg_msg_size:1343 estimated_size:-18783702968 02-12-2015 15:30:29 EST Camus INFO - Topic checkout will be skipped. 02-12-2015 15:30:29 EST Camus INFO - Please check whether kafka cluster configuration is correct. You can also specify config parameter: kafka.move.to.earliest.offset to start processing from earliest kafka metadata offset. The first Camus run where an instance of this log started occuring is in https://azkaban.data.shopify.com/executor?execid=149416&job=Camus and we stopped getting checkout kafka data in HDFS for that set of shops from then onwards.

cc @Shopify/data-acquisition @angelini

— Reply to this email directly or view it on GitHub.

pmangg commented 8 years ago

That Topic checkout will be skipped message started happening in https://azkaban.data.shopify.com/executor?execid=149416&job=Camus for a bunch of partitions, not just partition 12, however, in the latest run, only partition 12 has that message. Looking at the shops in partition all these partitions (the key for the checkout topic is shop_id), only shops from partition 12 are missing data from that latest run on the 25th.

pmangg commented 8 years ago

Should be fixed by https://github.com/Shopify/cookbooks/pull/9245