application-research / autoretrieve

A server to make GraphSync data accessible on IPFS
22 stars 7 forks source link

Growing memory and goroutines #142

Closed hannahhoward closed 1 year ago

hannahhoward commented 1 year ago

It appears the longer autoretrieve runs, the more goroutines run and the more likely autoretrieve is to crash.

I've attached goroutine dump and heap profile for the bedrock instance currently.

dumps.zip

hannahhoward commented 1 year ago

Largest go-routine counts appear to be running state machines and channel monitoring. Memory appears to be in memory copies of the data transfer states

profile001

rvagg commented 1 year ago

This is the memory issue we're chasing, my thoughts and current hypothesis follows.

profile002

rvagg commented 1 year ago

To describe this memory path, it's like this:

There are 3 Bytes terminals in a Graphsync message, there is a UUID (32 bytes), a CID prefix (3-4 bytes), but the most interesting is the data field of a GraphSyncBlock, which can be quite large. For autoretrieve, we're mostly dealing with UnixFS data (either dag-pb or raw) which are going to be up to ~1Mb in size; so these are potentially quite large chunks and it wouldn't take much to compound up to the sizes we're seeing. We were getting regular crashes in our bedrock instance at our 32G memory max, now we have double that but we still get regular crashes, they just happen less frequently.

So block data eh? What's the mechanism for leaking these since we can confirm it's not whole message data that we're leaking, it's specifically the block portion.

My hypothesis: Graphsync is dealing with DAGs, after it gets these blocks, it stores them but also decodes them so it can traverse and transfer more of the DAG. The obvious candidate here is decoding a Cid, since that's all Graphsync really cares about inside block data. We have nothing else in the path above Graphsync that would care about anything else inside these blocks. If we decode dag-cbor blocks, we're going to be hitting the same Readn() callsite because a Cid is just a tagged byte section in CBOR, so even if we were leaking via Cids then we'd have to have a lot of them because we're talking about ~34 bytes each. But, we're not dealing mainly with dag-cbor data, this is mostly dag-pb data because it's IPFS dht requests which are mostly UnixFS. dag-pb does it differently, it does slice-of-slice to decode its Data section and its Links array Hash elements which are Cids, which Graphsync will be wanting for traversal. This whole block here performs no new allocations: https://github.com/ipld/go-codec-dagpb/blob/061699f10dbb9e9bc8e8ed3552cbd8597730f412/unmarshal.go#L191-L197 - the Cid generated holds on to the underlying slice for the entire block.

Graphsync could be holding on to these somewhere, I don't see an obvious place, perhaps something in the new reconciledloader (which could even be holding on to the entire block) or maybe even in the traverser which cares about Cids. But I found a better candidate closer to home, the autoretrieve RandomPruner: https://github.com/application-research/autoretrieve/blob/master/blocks/randompruner.go#L50; when you have any pruning option enabled (which we do for bedrock: prune-threshold: 45 GB); the RandomPruner will act as a Blockstore wrapper and will hold on to any Cid that's passed in via Put() and PutMany(). The Cids are put in a map along with their timestamp and only cleared out when that time expires.

For this to work, Graphsync would need to be using the Cid from a previous block in the DAG when storing the bytes of that block that come from a later message (if it was just storing blocks in isolation as they come, with newly generated Cids for it then those would be new slices). I believe this is what happens: https://github.com/ipfs/go-graphsync/blob/386a1d8990e59f4b3ac66e633960efa7b9d2934b/requestmanager/reconciledloader/load.go#L134-L146 - the ReconciledLoader maintains the list of Cids it needs, then when it gets the bytes for that block, it writes those bytes and commits them with the original Cid that it was looking for, which would be a sub-slice of a previous dag-pb block and hence be associated with that original allocated data. Filclient uses Graphsync's LinkSystemForBlockstore to wrap a Blockstore, and that uses the Cid in question directly for the Put() call: https://github.com/ipfs/go-graphsync/blob/386a1d8990e59f4b3ac66e633960efa7b9d2934b/storeutil/storeutil.go#L42

So:

  1. Graphsync receives a message
  2. Graphsync decodes 1 or more block data
  3. Graphsync decodes block data to find more Cids to traverse; Cids from dag-pb blocks are sub-slices of the whole dag-pb block
  4. Graphsync receives blocks for those Cids it's looking for
  5. Graphsync stores block data along with original Cid
  6. Filclient passes the storage operation through a blockstore Put(cid, bytes)
  7. Autoretrieve's blockstore, wrapped by a RandomPruner, holds on to the cid, and therefore also the associated underlying block data for 1 hour
  8. With enough volume, we have a lot of dag-pb blocks in memory, held on to just by one or more Cids which are used in association with the data that was found to match the Cid

To get to 32G, we'd need ~65k x 500k blocks. We're only doing ~5k successful "retrievals" per hour, but that's for a root of a potentially large DAG, and that doesn't count failed but partial retrievals. Maybe those numbers could get us there? Let's find out.

I'm about to submit a PR that will test this hypothesis by cloning the Cid; then we should only accumulate Cid memory (which is still going to be a lot .. but shouldn't be unmanageable). If it's not this then I reckon it's something close to this. If it is this, then it raises interesting questions about our stack and whether we should "fix" this at one or more points, including possibly dag-pb and graphsync.