Closed rvagg closed 2 years ago
After discussing in a little more detail, we came up with a flat ProviderRetrieval
model that +/- some of the properties in the original proposal here. This is not a finalized model and can be iterated on!
ProviderRetrieval
{
providerPeerID: peerID
retrievalId: UUID
cid: CID
autoretrieveInstanceId: string
lastStage: string
errorMessage: string
startedAt: datetime
endedAt: datetime
sizeBytes: bytes
completed: boolean
}
Merging #97 (f94ff73) into master (59ea9dd) will increase coverage by
4.41%
. The diff coverage is29.60%
.
@@ Coverage Diff @@
## master #97 +/- ##
==========================================
+ Coverage 7.80% 12.22% +4.41%
==========================================
Files 12 16 +4
Lines 1767 2102 +335
==========================================
+ Hits 138 257 +119
- Misses 1626 1831 +205
- Partials 3 14 +11
Impacted Files | Coverage Δ | |
---|---|---|
autoretrieve.go | 0.00% <0.00%> (ø) |
|
bitswap/provider.go | 0.00% <0.00%> (ø) |
|
config.go | 0.00% <0.00%> (ø) |
|
...ecoin/eventrecorder/filelogserver/filelogserver.go | 0.00% <0.00%> (ø) |
|
filecoin/retriever.go | 0.00% <0.00%> (ø) |
|
filecoin/retrieverevents.go | 0.00% <0.00%> (ø) |
|
filecoin/activeretrievals.go | 69.84% <69.84%> (ø) |
|
filecoin/eventrecorder/eventrecorder.go | 72.09% <72.09%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 59ea9dd...f94ff73. Read the comment docs.
currently depends on https://github.com/application-research/filclient/pull/86 for the filclient dependency
I'd like to discuss what we're actually reporting since we have both the query-ask phase and the actual retrieval phase. Also, there's a lot of disparate info to be collected through these events and I'm not really sure a flat endpoint makes all that much sense. To illustrate, I've broken the (possible) reporting into an interface where you can see what we could report at each point: https://github.com/application-research/autoretrieve/pull/97/files#diff-de7628819e4344a18be7e7f90c87c7c502d5bee26a94a4eb1cce9031ca1646c2R15-R57
Note there's already some overlap there - the cid keeps getting repeated when maybe we should just be doing that with the start event(s).
Currently I'm bundling this all into a single message type, with lots of optional (empty) fields depending on the event. In IPLD schema terms that's here: https://github.com/application-research/autoretrieve/pull/97/files#diff-15ae65e0803959b92bfa25397445c590a3f793379e1129a7e23252766f0efcd9R37-R69
But then there's also the URL. I've started with the proposed, but have varied it to add the query-ask phase. So we have these 3 variations:
https://.../query-event/~uuid~
(start of query-ask, we haven't asked anyone yet but we want a timestamp)https://.../query-event/~uuid~/providers/~peerID~
(during query-ask, in parallel to all of the candidates)https://.../retrieval-event/~uuid~/providers/~peerID~
(during retrieval to each candidate until we get a success)(code still not really ready for review, dealing with nasty race problems trying to match filclient's event reporting with the data we have available here)
@hannahhoward so this is still a draft, I'm not proposing a final form of this here, just wanting discussion on the options available. I think probably a sync discussion is in order:
Still not quite done & ready for proper review, but working and reporting as we want it I believe.
runningRetrievals
map) .. although it's been quite useful for chasing down leaks so I'm now wondering about keeping it.tbh most of the time I've spent on this today, and a bit of yesterday, is chasing down leaks caused by incomplete event coverage between filclient and here. I've perfected some methods for tracking those using both logs and some new debug reporting in here and I'm fairly confident that the coverage is spot on now.
I have an outstanding concern about the query calls, which we don't apply a timeout to, at least I'm not seeing it if it's there. So far I'm not seeing it as a cause of leaks though, perhaps there's an implicit socket timeout libp2p is giving us. But we may want to consider addressing that—in the current code here there's a commented-out block that uses a deadline to impose the timeout (which is wired up through filclient already too). If we care about quick retrieval we should be fairly ruthless on query timeouts; but if we care about data collection then some lenience might be in order too! Hopefully our data collection shows us query times nicely.
Lastly, I have an example log output for this - using the filelogserver.go in this PR to generate a line-delimited JSON file containing the events POSTed to it: https://gist.github.com/rvagg/a9b941f1b423679659812fff1b9bf0a1
Cleaned up, simplified a little and added more tests. The only remaining thing I wouldn't mind doing is linking it to a filclient branch, this currently doesn't compile in CI and requires a replace
in go.mod to my fork @ https://github.com/application-research/filclient/pull/87. If/when I get write access I'll push a branch there and update here so it'll work properly.
Event log output should be the same as yesterday so my gist sample should still apply: https://gist.github.com/rvagg/a9b941f1b423679659812fff1b9bf0a1
This now has the branch @ https://github.com/application-research/filclient/pull/88 in go.mod, so it's compiling and testing in CI here and shouldn't have all of the error messages inline which should make it easier to review.
Closes: https://github.com/application-research/autoretrieve/issues/96
I thought I'd have a go at this, I think we discussed dividing it up this way since @kylehuntsman is so keen to do https://github.com/filecoin-project/autoretrieve-deploy/issues/2 (and I'm not). This adds an HTTP PUT to
https://foo.bar/retrieval-event/~uuid~/providers/~minerid
for both successful and unsuccessful retrievals, using data I could collect at that point and seems reasonable to want. We discussed, (and I think agreed on?) flattening it to just the one endpoint and letting the server handle aggregation of uuids. The URL string before/retrieval-event...
comes from the config, a new, optional field:event-recorder-endpoint-url
—if not set, it's all a noop.See the
event
struct for what's included in the message. One thing I couldn't add in is "stage" because I couldn't figure out what that might refer to!