Joystream / joystream

Joystream Monorepo

http://www.joystream.org

GNU General Public License v3.0

1.42k stars 115 forks source link

[Colossus] Avoid fetching list of all data objects (i.e. calling `api/v1/state/data-objects`) when performing sync #5008

Closed zeeshanakram3 closed 10 months ago

zeeshanakram3 commented 10 months ago

Problem

While syncing the data object/s, the storage node needs to know the existence of required object/s from peer nodes and then pick a URL to download object/s from. However, the problem is that for each asset the node needs to sync, it calls api/v1/state/data-objects on all the peer nodes until it picks a URL to download the asset from.

https://github.com/Joystream/joystream/blob/46e75506e9639dae4bf67a8ff7e322166ad522ee/storage-node/src/services/sync/tasks.ts#L220-L225

/state/data-objects does not return a constant size response, and hence response size and latency grow linearly, for reference, currently some nodes return the data objects response over 5MB in size.

Solution

Maybe use HEAD /files/{Id} to know the availability of assets on a given node, OR
Or create a new lightweight endpoint, that maybe return a boolean response indicating the presence of asset.

yasiryagi commented 10 months ago

Can we call /state/data-objects once per sync timer and cache it. That will reduce the overhead and could more efficient than head per object.

zeeshanakram3 commented 10 months ago

Can we call /state/data-objects once per sync timer and cache it. That will reduce the overhead and could more efficient than head per object.

Yeah, I guess that should work too, but instead of getting all the data objects per sync (/state/data-objects), we should only get the data-objects of bags that we need to sync (/state/bags/{bagId}/data-objects) and then cache it.

mnaamani commented 10 months ago

Can we call /state/data-objects once per sync timer and cache it. That will reduce the overhead and could more efficient than head per object.

Currently it does cache the result, but for a short period 3min. So if the sync interval is larger than this caching period, which I think the operators are setting to at minimum 10min, the data is always fetched again.

mnaamani commented 10 months ago