Closed hannahhoward closed 1 year ago
Largest go-routine counts appear to be running state machines and channel monitoring. Memory appears to be in memory copies of the data transfer states
This is the memory issue we're chasing, my thoughts and current hypothesis follows.
To describe this memory path, it's like this:
Bytes
sectionBytes
, the source of these bytes can be seen here: https://github.com/polydawn/refmt/blob/30ac6d18308e584ca6a2e74ba81475559db94c5f/shared/reader.go#L107There are 3 Bytes
terminals in a Graphsync message, there is a UUID (32 bytes), a CID prefix (3-4 bytes), but the most interesting is the data
field of a GraphSyncBlock
, which can be quite large. For autoretrieve, we're mostly dealing with UnixFS data (either dag-pb or raw) which are going to be up to ~1Mb in size; so these are potentially quite large chunks and it wouldn't take much to compound up to the sizes we're seeing. We were getting regular crashes in our bedrock instance at our 32G memory max, now we have double that but we still get regular crashes, they just happen less frequently.
So block data eh? What's the mechanism for leaking these since we can confirm it's not whole message data that we're leaking, it's specifically the block
portion.
My hypothesis: Graphsync is dealing with DAGs, after it gets these blocks, it stores them but also decodes them so it can traverse and transfer more of the DAG. The obvious candidate here is decoding a Cid
, since that's all Graphsync really cares about inside block data. We have nothing else in the path above Graphsync that would care about anything else inside these blocks. If we decode dag-cbor blocks, we're going to be hitting the same Readn()
callsite because a Cid
is just a tagged byte section in CBOR, so even if we were leaking via Cid
s then we'd have to have a lot of them because we're talking about ~34 bytes each. But, we're not dealing mainly with dag-cbor data, this is mostly dag-pb data because it's IPFS dht requests which are mostly UnixFS. dag-pb does it differently, it does slice-of-slice to decode its Data
section and its Links
array Hash
elements which are Cid
s, which Graphsync will be wanting for traversal. This whole block here performs no new allocations: https://github.com/ipld/go-codec-dagpb/blob/061699f10dbb9e9bc8e8ed3552cbd8597730f412/unmarshal.go#L191-L197 - the Cid
generated holds on to the underlying slice for the entire block.
Graphsync could be holding on to these somewhere, I don't see an obvious place, perhaps something in the new reconciledloader (which could even be holding on to the entire block) or maybe even in the traverser which cares about Cid
s. But I found a better candidate closer to home, the autoretrieve RandomPruner
: https://github.com/application-research/autoretrieve/blob/master/blocks/randompruner.go#L50; when you have any pruning option enabled (which we do for bedrock: prune-threshold: 45 GB
); the RandomPruner
will act as a Blockstore wrapper and will hold on to any Cid
that's passed in via Put()
and PutMany()
. The Cid
s are put in a map along with their timestamp and only cleared out when that time expires.
For this to work, Graphsync would need to be using the Cid
from a previous block in the DAG when storing the bytes of that block that come from a later message (if it was just storing blocks in isolation as they come, with newly generated Cid
s for it then those would be new slices). I believe this is what happens: https://github.com/ipfs/go-graphsync/blob/386a1d8990e59f4b3ac66e633960efa7b9d2934b/requestmanager/reconciledloader/load.go#L134-L146 - the ReconciledLoader
maintains the list of Cid
s it needs, then when it gets the bytes for that block, it writes those bytes and commits them with the original Cid
that it was looking for, which would be a sub-slice of a previous dag-pb block and hence be associated with that original allocated data. Filclient uses Graphsync's LinkSystemForBlockstore
to wrap a Blockstore
, and that uses the Cid
in question directly for the Put()
call: https://github.com/ipfs/go-graphsync/blob/386a1d8990e59f4b3ac66e633960efa7b9d2934b/storeutil/storeutil.go#L42
So:
Cid
s to traverse; Cid
s from dag-pb blocks are sub-slices of the whole dag-pb blockCid
s it's looking forCid
Put(cid, bytes)
RandomPruner
, holds on to the cid
, and therefore also the associated underlying block data for 1 hourCid
s which are used in association with the data that was found to match the Cid
To get to 32G, we'd need ~65k x 500k blocks. We're only doing ~5k successful "retrievals" per hour, but that's for a root of a potentially large DAG, and that doesn't count failed but partial retrievals. Maybe those numbers could get us there? Let's find out.
I'm about to submit a PR that will test this hypothesis by cloning the Cid
; then we should only accumulate Cid
memory (which is still going to be a lot .. but shouldn't be unmanageable). If it's not this then I reckon it's something close to this. If it is this, then it raises interesting questions about our stack and whether we should "fix" this at one or more points, including possibly dag-pb and graphsync.
It appears the longer autoretrieve runs, the more goroutines run and the more likely autoretrieve is to crash.
I've attached goroutine dump and heap profile for the bedrock instance currently.
dumps.zip