application-research / autoretrieve

A server to make GraphSync data accessible on IPFS
22 stars 7 forks source link

Provider rewrite #158

Closed elijaharita closed 1 year ago

elijaharita commented 1 year ago

The provider implementation I wrote originally did not work very well. Due to the structure of the old provider code (everything stored on goroutine stacks), it was very difficult to keep track of the connections between requests, retrievals, and responses, and many requests would never receive any response. I could not identify an easy solution working off of the existing code so I started from a blank slate. It works great now.

The new architecture keeps request, retrieval, and response tasks in three clean peer task queues. This makes it a lot easier for the provider to track what's going on, and requests are no longer dropped.

TODO

A failed retrieval currently does not result in DONT_HAVE being sent. This is not a regression, because Autoretrieve has never done that. It will be a lot easier to add this now but I don't think it needs to be a part of this PR.

Required config.yaml changes

Removed

Added

codecov-commenter commented 1 year ago

Codecov Report

Base: 5.01% // Head: 4.63% // Decreases project coverage by -0.38% :warning:

Coverage data is based on head (5c35eef) compared to base (fd80e5e). Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #158 +/- ## ========================================= - Coverage 5.01% 4.63% -0.39% ========================================= Files 15 15 Lines 1675 1814 +139 ========================================= Hits 84 84 - Misses 1586 1725 +139 Partials 5 5 ``` | [Impacted Files](https://codecov.io/gh/application-research/autoretrieve/pull/158?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research) | Coverage Δ | | |---|---|---| | [bitswap/provider.go](https://codecov.io/gh/application-research/autoretrieve/pull/158?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research#diff-Yml0c3dhcC9wcm92aWRlci5nbw==) | `0.00% <0.00%> (ø)` | | | [config.go](https://codecov.io/gh/application-research/autoretrieve/pull/158?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research#diff-Y29uZmlnLmdv) | `0.00% <0.00%> (ø)` | | | [main.go](https://codecov.io/gh/application-research/autoretrieve/pull/158?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research#diff-bWFpbi5nbw==) | `1.27% <0.00%> (-0.16%)` | :arrow_down: | | [autoretrieve.go](https://codecov.io/gh/application-research/autoretrieve/pull/158?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research#diff-YXV0b3JldHJpZXZlLmdv) | `0.00% <0.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=application-research)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

kylehuntsman commented 1 year ago

Thanks for keeping the BitswapResponse metrics 👍

rvagg commented 1 year ago

Looks pretty good to me, nothing obvious standing out as a big problem (except for exacerbating the no-tests issues we have with this codebase). I wouldn't mind seeing that blockstore error log put back in that I've mentioned inline though.

The hard-wired 8 is tempting to make configurable, but I don't really know what alternative numbers would be worth using! the retrieval thread isn't blocking for long so the 8 there is a bit of overkill. However, we're going to be having conversations in the coming weeks about retriever API that may impact this. My personal preference would be making a Retrieve() call blocking; in which case, there's opportunity here to make the number of goroutines started for retrieval here be the maximum number of parallel retrievals we're willing to perform (the AwaitGet stuff would have to be spawned off in its own goroutine before calling Retrieve() and blocking but that would be fine).

elijaharita commented 1 year ago

@rvagg i fixed up the things you pointed out, aside from tests. would it be cool to put tests in a subsequent pr so we can get this merged?

rvagg commented 1 year ago

CI hopefully fixed by https://github.com/application-research/autoretrieve/pull/166

elijaharita commented 1 year ago

Hello I'm really sorry I'm late on the review, cause I might have caught the stuff that broke the provider rewrite.

The short version of what I think is broken is the peertaskqueue includes a freezing feature that's not well documented, that gets triggered whenever a task is removed from the queue due to a cancel. There are basically two ways you can deal with this:

  1. Pass the IgnoreFreezing() option when you construct the peer task queues
  2. Appropriately thaw the queue to under freezing -- I've put notes where I think this should go, plus references to the original go-bitswap code.

One question overall I'd like to raise: do we really need all these levels of parallelism? And if so, should they all be mediated by the peer task queue rather than simple channels / FIFO queues?

Finally, how can we get some tests in here? We've now broken autoretrieve running live twice by messing with Bitswap Provider code, which isn't totally surprising given that almost all implementations of the bitswap provider are notoriously finicky and complex (a protocol weakness, but one we have to deal with). It seems like we need to have at minimum tests in the code, and possibly a live testing process.

ok, before any more work on bitswap code, i will add tests to make sure everything is working correctly.

switching to fifo might be a better solution. i'll explore that once tests are ready.

Parellelism is not necessarily needed, and can be disabled in config by setting worker counts to 1, however splitting processing steps into discrete queues i think is the right solution