Yellow-Dog-Man / Resonite-Issues

Issue repository for Resonite.
https://resonite.com
125 stars 2 forks source link

Proposal to increase the default number of concurrent downloads or dynamically scale them based on the network. #2162

Open Stellanora64 opened 1 month ago

Stellanora64 commented 1 month ago

Is your feature request related to a problem? Please describe.

A common issues I have seen in reviews and reports by new players joining Resonite is slow loading times.

Currently the maximum number of concurrent downloads is relatively low for most people's networks, and once this is increased to a reasonable value for their network loading times decrease substantially.

Describe the solution you'd like

Increase the maximum concurrent downloads to a higher number than what the current default is, obviously not so high that it causes network issues, but could still lead to a substantial increase in loading times.

A more involved but complex solution would be resonite testing the user's network and the dynamically setting the maximum concurrent downloads value based on their network performance. This would allow resonite to better scale to people with more networking capacity but still not oversaturate the network for people with less network capacity. This could also be done for concurrent asset transfers for users with better upload speed.

While the dynamic scaling is the best solution, it may take more engineering time than it's worth, at this moment in time, than raising the default.

Describe alternatives you've considered

Manually testing and changing the value in settings.

Additional Context

No response

Requesters

stellanora

Frooxius commented 1 month ago

What metrics are you using for this?

The default value I've actually settled on as something that doesn't seem to cause issues to most people - and it's also the default in .NET for connections to the same server.

I'm a bit concerned increasing this by default - I don't even know if that would actually solve the slow loading times people report and not make it worse.

The dynamic scaling is unlikely to happen - we don't have good signals to actually scale this up and down and implementing something like that adds A TON of engineering complexity - we're unlikely to put that much engineering effort in, that's already been put into these protocols.

We'd more likely want to explore utilizing HTTP 2.0 and stuff like that for efficiency, which already had tons of work put into it, rather than rolling our own solution.

Stellanora64 commented 1 month ago

This is mainly just based around my experience after increasing the value, I don't have any precise performance metrics right now, I can get some later if you would like me to. Anecdotally though, I saw a decrease of around half in loading times by increasing mine to 32, with 100mbs down for my network ( it also used up to ~70mbps to 90mbps of my network instead of the usual 10mbps).

But HTTP 2.0 seems to be the better option, I wasn't aware it's not in resonite yet. Should there be an issue made for it's implementation for future reference or just not worry about it?

Dessix commented 1 month ago

HTTP/3 is standardized and starting to be implemented in a lot of places- and dotnet 7 picked up the first support for it; perhaps that would be a good option?

As for the dotnet defaults- yeah, they're obscenely low for legacy reasons, but they expect you to bump it up.

I've found that about 64 concurrent downloads is the peak beyond which no noticeable improvement occurs on gigabit because it gets bottlenecked on actually loading the assets at that pace, on a ~5 year old machine. 128 didn't seem to slow it down or speed it up, so it has a limit of utility. Perhaps the performance optimizations or the Sauce render backend will improve those aspects in the future, but I can at least confirm that it got faster to load when the number was raised.

Frooxius commented 1 month ago

If HTTP/3 is supported, we can go for that as well, but I don't think everything in chain we use supports it yet.

We can bump it up if there's good metrics for it, but I also remember there being some issues with some people when it's higher.

However to really focus on this, we'd need some metrics on this actually giving a substantial improvement in load times - there's a lot of things that might be slowing it down, so I don't want us to go chasing the wrong thing.

Dessix commented 1 month ago

The primary place where HTTP/3 is still a bit lacking is in layer 4 load balancing, but that may not be a huge issue given the nature of downloading files being something you can do DNS balancing on instead. HAProxy currently exposes a load balancing implementation that's been out for about a year now, so it's not infeasible by any means.

H2 or H3 would be of most benefit if the requested items are many small files, while individual requests are fine if the content exceeds a megabyte or two on average.

Frooxius commented 1 month ago

We're using 3rd party service for the downloads (Cloudflare) and metadata (Azure), so it just really depends on what they are using and supporting, we're unlikely to roll our own solution & protocols for this.

There often times are lots of small files to download. But there is also a lot of metadata to fetch and process, which can influence the loading speed.

That's why I think we're getting a bit too deep into this, without establishing some metrics where increased concurrent downloads actually do seem to help.

bdunderscore commented 1 month ago

FYI: Cloudflare does support HTTP/2 and HTTP/3.

bdunderscore commented 1 month ago

Also, increasing concurrency will have a higher impact when the user is further away from the origin server. Having higher concurrency would therefore be particularly important for APAC users (if the origin is in the US or EU). At 100ms and 4x concurrency you're only loading 40 assets/second at the most...

Frooxius commented 1 month ago

@bdunderscore Ah thanks for the info. Though one thing I can't find there is if their R2 public buckets support it or if it's limited to CDN? From cursory search it seems that R2 might not support it yet, but I haven't done super much research into this.

With Cloudflare though, the assets should be downloaded from a source relatively close to the user.

The main thing I'd want to see is some testing & data on this to justify the increase. I don't have a way to verify whether the changes would be beneficial or not otherwise, because I don't know if these assumptions will hold.

bdunderscore commented 1 month ago

If the asset in question isn't cached in a nearby edge location, Cloudflare will still need to go back to the origin. What I'd recommend is collecting data on time-to-first-response-byte on asset fetch jobs - high latency on asset fetches would indicate that you'd be able to get value out of parallelizing small asset fetches. You might also want to consider a very simple policy of allowing X downloads of small assets and Y downloads of large assets in parallel.

bdunderscore commented 1 month ago

The other thing that splitting small vs large does for you is helping to avoid the situation where all download threads are stuck downloading big textures when downloading instead a bunch of small meshes would go further towards showing the user a general idea of what's going on in the world.

Frooxius commented 1 month ago

If you want us to run the tests and collect the data, this increases the "size" of this issue on our end. We can do that, but that means it's less likely to be picked up.

Since this is a proposal coming from the community and there's a lot of statements being made on what will help, but no supporting data at all, this makes us much less likely to act on this.

Doing some preliminary tests and gathering data on the impact of the changes would help us put more work into this.

bdunderscore commented 1 month ago

As an initial datapoint, loading assets from resonite can take about 200ms in some cases:

Image

This would make sense if the origin storage is in the EU, since I'm coming from the US west coast.

I'll try to capture some more traces to compare before/after but it's a bit tricky as I don't have control of the CDN's caching policy, so I can't force a cache miss (each test influences the next).

bdunderscore commented 1 month ago

Some more data. These are packet rate traces gathered when loading several arbitrarily-selected worlds in resonite. Due to cache priming effects, I used a different world for each test, but hopefully this anecdata will be enough to motivate some changes. First, a couple of baseline traces at concurrent downloads=8:

concurrency 8 concurrency 8

Next, one at concurrency 32:

Image

Finally, one where I started at 8, then changed to 64 at the 35s mark, and back to 32 around the 80 second mark:

Image

All traces were taken from the US west coast on a 1gbit network connection. You'll likely see less effect if you're near the asset origin servers.

bdunderscore commented 1 month ago

One note: The red bars indicate TCP anomalies - things like duplicate ACKs, out-of-order packets, and retransmissions.

bdunderscore commented 1 month ago

Also - due to NIC-level TCP reassembly offloads, frame lengths as seen by wireshark are larger than on the network directly - some of them around 7kb for example. But it gives a general idea of the order of magnitude gains that we could see.

Readun commented 1 month ago

Small headsup, I noticed there might be a problem with going to 64 with a headless. Mine just suddenly closed without any exception in the logs, while in the middle of running (5-20 min). 32 are fine.

But: I haven't done a github issue yet, as I didn't had time to fully confirm it yet without outflow. So this was with a networking mod.

Gawdl3y commented 1 week ago

@bdunderscore Ah thanks for the info. Though one thing I can't find there is if their R2 public buckets support it or if it's limited to CDN? From cursory search it seems that R2 might not support it yet, but I haven't done super much research into this.

With Cloudflare though, the assets should be downloaded from a source relatively close to the user.

The main thing I'd want to see is some testing & data on this to justify the increase. I don't have a way to verify whether the changes would be beneficial or not otherwise, because I don't know if these assumptions will hold.

R2 buckets do support HTTP/2 at least, but I'm not certain about HTTP/3 due to sparse information. HTTP/2 alone would probably be a pretty substantial benefit for downloading assets, but HTTP/3 would be even better obviously. My understanding based on old conversations I had with Froox was that the use of HTTP/2 was less limited by the cloud and more so the client, since there aren't many old .NET libraries available that actually support newer HTTP versions, thus potentially making HTTP/2 support hinge on the move to a newer runtime.