farcasterxyz / hub-monorepo

Implementation of the Farcaster Hub specification and supporting libraries for building applications on Farcaster
https://www.thehubble.xyz
MIT License
715 stars 417 forks source link

Service using shuttle was stuck for a day, without any errors. #2157

Closed chetankashetti closed 1 month ago

chetankashetti commented 4 months ago

What is the bug? Shuttle service was stuck for a day without any error logs or exceptions.

How can it be reproduced? We have 3 shards running for live subscription. out of them two were stuck, shard-0 and shard-2.

We observed we are no more receiving the data from shuttle, and when we saw the logs there were no error logs. some of the metrics we looked at was hubs (cpu and memory) and service(cpu and memory) and RDS all look totally fine. in fact underutilised. some of the screenshots indicating no interaction and kept hanging state for a while not sure if even connection was still there. image image

While it was stuck for a day, first action we did was to restart the pod. when we did that it started syncing from the eventId it was stuck. it took few hours to sync. but once it was live, observed that the cast i made an hour back didn't get indexed, ideally it should have indexed? because live stream holds data for 3 days. and it missed my cast, similarly might have missed others as well.

So, just to summarise we wanted to know couple of things

  1. Why service was stuck at an eventId, without any error. though health of components looks good.?
  2. Does live event subscription cover all events if it was stopped for a an hour or two or for a while(less than 3 days) ?

we are not able to reproduce the issue, but we have observed only once. Additional context

sds commented 1 month ago

Thank you for the report, sorry for the delay in response. Shuttle has seen multiple improvements related to issues such as this since this was opened. If you're still seeing this on the latest version of shuttle, feel free to open a new ticket with the latest evidence + details you have, as it is likely a different issue at this point.

Thank you!