application-research / autoretrieve

A server to make GraphSync data accessible on IPFS
22 stars 7 forks source link

feat: increase http reporting throughput, add timeout #112

Closed rvagg closed 2 years ago

rvagg commented 2 years ago

Problem outline: My local instance was starting to show a lot of ErrRetrievalAlreadyRunning errors for given CIDs that were coming in over bitswap, which is a bit weird. The log output I'm getting to my terminal is showing many thousands of "active retrievals", and querying the event log database we're now collecting was showing high numbers as well. Taking a goroutine pprof was showing most of the goroutines stuck in filclient's event reporter, waiting for a listener to return.

This suggests (pretty strongly), that the EventReporter is not freeing up the channel it maintains for events to push through to the HTTP endpoint, and they're backing up rapidly.

My local connection is not great, and I'm ~250ms round-trip delayed from us-west (and I think we're reporting to us-east, so even worse) and we do a lot of queries—multiple per CID that the indexer says we know about. So my theory is that it's just not able to spit out the events fast enough to chew through the list and they're building up, blocking all the goroutines that we're spawning in queryCandidates() on the trip through filclient and back to the event reporter.

Querying the event database for number of events from the 'bedrock-dev' instance over the last hour shows nearly ~11 events per second, which is probably a lot for my connection. (I think?)

So with this PR I'm attempting a few things:

  1. Using an http.Client with a short timeout, I believe DefaultClient has no timeout
  2. Reusing the same client (I don't know if calling http.DefaultClient results in reuse of the same resources? I'm hoping there's some possibilities for better connection management here)
  3. Pushing off the actual event posting into a pool of goroutines which can chew through a channel with plenty of buffer