application-research / autoretrieve

A server to make GraphSync data accessible on IPFS
22 stars 7 forks source link

Cache all cids in memory in pruner #175

Open hannahhoward opened 1 year ago

hannahhoward commented 1 year ago

Goals

Performance testing indicates the pruner is still a huge bottleneck on performance, and the hot spot is reading the all keys chan.

By our calculations, with not too much penalty we can keep a list of all keys in memory, and thus avoid this bottleneck.

Implementation

For discussion

This is complicated enough I want to write tests before anyone merges it, but want folks to review the approach now.

codecov-commenter commented 1 year ago

Codecov Report

Base: 5.43% // Head: 5.31% // Decreases project coverage by -0.12% :warning:

Coverage data is based on head (a1610e9) compared to base (27eabc8). Patch coverage: 0.00% of modified lines in pull request are covered.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #175 +/- ## ========================================= - Coverage 5.43% 5.31% -0.12% ========================================= Files 14 14 Lines 1639 1674 +35 ========================================= Hits 89 89 - Misses 1545 1580 +35 Partials 5 5 ``` | [Impacted Files](https://codecov.io/gh/application-research/autoretrieve/pull/175?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research) | Coverage Δ | | |---|---|---| | [blocks/randompruner.go](https://codecov.io/gh/application-research/autoretrieve/pull/175?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research#diff-YmxvY2tzL3JhbmRvbXBydW5lci5nbw==) | `0.00% <0.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

rvagg commented 1 year ago

I still hate this file-writing business. I reckon now it'd be more efficient, and safe, to just iterate over the CID map and roll the dice on each one, with some probability of 1% or something small. Delete when they lose the dice roll and aren't pinned, stop iterating if we reach our threshold, or iterate again if we haven't (maybe with some safety around it like don't loop more than 20 times). Go's unstable map iteration ordering even helps a bit here.

elijaharita commented 1 year ago

i didn't take the effort to estimate how much memory cids would take up when i wrote this. in my head it sounded big. but u are right, with a blockstore target size of around 100GiB it seems fairly negligible, a server who wants bigger cache should have plenty of memory available too.

if we do keep all the cids in memory, it opens the door for a much less dumb pruner solution - fifo? doesn't need to be implemented now but probably should do at some point.

elijaharita commented 1 year ago

and i think @rvagg is right, file io definitely should go away now if keys are in memory already