Joystream / joystream

Joystream Monorepo
http://www.joystream.org
GNU General Public License v3.0
1.42k stars 115 forks source link

[Colossus] Avoid fetching list of all data objects (i.e. calling `api/v1/state/data-objects`) when performing sync #5008

Closed zeeshanakram3 closed 10 months ago

zeeshanakram3 commented 10 months ago

Problem

While syncing the data object/s, the storage node needs to know the existence of required object/s from peer nodes and then pick a URL to download object/s from. However, the problem is that for each asset the node needs to sync, it calls api/v1/state/data-objects on all the peer nodes until it picks a URL to download the asset from.

https://github.com/Joystream/joystream/blob/46e75506e9639dae4bf67a8ff7e322166ad522ee/storage-node/src/services/sync/tasks.ts#L220-L225

/state/data-objects does not return a constant size response, and hence response size and latency grow linearly, for reference, currently some nodes return the data objects response over 5MB in size.

Solution

  1. Maybe use HEAD /files/{Id} to know the availability of assets on a given node, OR
  2. Or create a new lightweight endpoint, that maybe return a boolean response indicating the presence of asset.
yasiryagi commented 10 months ago

Can we call /state/data-objects once per sync timer and cache it. That will reduce the overhead and could more efficient than head per object.

zeeshanakram3 commented 10 months ago

Can we call /state/data-objects once per sync timer and cache it. That will reduce the overhead and could more efficient than head per object.

Yeah, I guess that should work too, but instead of getting all the data objects per sync (/state/data-objects), we should only get the data-objects of bags that we need to sync (/state/bags/{bagId}/data-objects) and then cache it.

mnaamani commented 10 months ago

Can we call /state/data-objects once per sync timer and cache it. That will reduce the overhead and could more efficient than head per object.

Currently it does cache the result, but for a short period 3min. So if the sync interval is larger than this caching period, which I think the operators are setting to at minimum 10min, the data is always fetched again.

mnaamani commented 10 months ago

Suggested solution on zoom call:

No need to pre-determine if an operator has object before attempting to fetch it. Just do best effort to fetch from other operators that should be storing the same bag which the object belongs in.

kdembler commented 10 months ago

@mnaamani @zeeshanakram3 can we close this after sync rework?