hannahhoward commented 1 year ago

Goal

We have chronic memory usage issues with Bedrock's autoretrieve instance. Memory allocated in go-graphsync upon receipt of new blocks makes it's way into autoretrieve's blocks.Manager.Put calls and then all the way into bitswap's peer task queue. With the default config of only one Bitswap message sending worker, this means under heavy bitswap requests, huge sets of block data are being put on the task queue as "Topic"s. They're held from the time a retrieval gets a block to the time a block gets to the top of the task queue to send over the wire. These topics, while a blank interface, were never meant to hold large objects like blocks, only intentifiers of data like cids, or other smaller items. This PR reverts them back to CIDs, like go-bitswap uses, and only reads them back off disk when a block is sending.

Implementation

Change the block managers GetAwait to be called "AwaitBlock" -- indicating you're just waiting for the block to be present, not actually getting all it's data
the call back now only returns a cid & block size, which is sufficient to put the task onto the queue even for a wantBlock
modify the message send code to read blocks blocks from disk at the time of moving them over the wire

hannahhoward commented 1 year ago

@elijaharita we know you originally did the refactor of the provider that made the blocks the topic for wantBlock calls. We're not sure of the full reasoning here, but we'd love to hear more.

In the meantime, we'd like to merge this (after review) and see if it resolves our memory issues -- if it turns out the problems are elsewhere, and this has significant throughput issues, we can revert back.

hannahhoward commented 1 year ago

We've been trying to solve our memory related issues for almost a week, because Go has terrible reference tracking (so heap dumps don't give us real answers across go routines) -- we've tried most other things and this seems like a much more likely hypothesis.

codecov-commenter commented 1 year ago

Codecov Report

Base: 20.84% // Head: 20.79% // Decreases project coverage by -0.05% :warning:

Coverage data is based on head (c61a585) compared to base (8216561). Patch coverage: 22.22% of modified lines in pull request are covered.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #155 +/- ## ========================================== - Coverage 20.84% 20.79% -0.06% ========================================== Files 16 16 Lines 2341 2347 +6 ========================================== Hits 488 488 - Misses 1823 1829 +6 Partials 30 30 ``` | [Impacted Files](https://codecov.io/gh/application-research/autoretrieve/pull/155?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research) | Coverage Δ | | |---|---|---| | [bitswap/provider.go](https://codecov.io/gh/application-research/autoretrieve/pull/155/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research#diff-Yml0c3dhcC9wcm92aWRlci5nbw==) | `0.00% <0.00%> (ø)` | | | [blocks/randompruner.go](https://codecov.io/gh/application-research/autoretrieve/pull/155/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research#diff-YmxvY2tzL3JhbmRvbXBydW5lci5nbw==) | `0.00% <0.00%> (ø)` | | | [blocks/manager.go](https://codecov.io/gh/application-research/autoretrieve/pull/155/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research#diff-YmxvY2tzL21hbmFnZXIuZ28=) | `80.00% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

elijaharita commented 1 year ago

@elijaharita we know you originally did the refactor of the provider that made the blocks the topic for wantBlock calls. We're not sure of the full reasoning here, but we'd love to hear more.

In the meantime, we'd like to merge this (after review) and see if it resolves our memory issues -- if it turns out the problems are elsewhere, and this has significant throughput issues, we can revert back.

no worries, i didn't have a specific thought process for this and i am not surprised there might be issues with it. if u have a more correct solution then by all means, go for it!

application-research / autoretrieve

Don't put blocks on task queue #155

Goal

Implementation

Codecov Report