MPQ read speedups -- possible native code required

collinsmith commented 5 years ago

I've noticed there are some slowdowns when reading files from MPQs, specifically pkexplode. I tried changing readBytes to read as many sectors at once (up to the uncompressed sector size) thinking this was one of the major issues (a compressed sector is only about 1500ish bytes, so I can read 3-4 at a time and be under 4K), but that didn't have as large of an effect as I had hoped. I also have looked into other file reading strategies, but I don't think they will provide a significant increase.

This is an issue on android where monstats.txt alone (90K compressed, 432K uncompressed) is taking 3-4 seconds to read, completely stalling the application start -- as an intermediate fix, I will likely add multi-threaded support to read the initial txt files in the background (MPQ reads are single-threaded). This will require a new AssetManager because it's only single-threaded, but I was planning on that anyways because of this exact reason -- I'd like a way to pre-load things in the background of AssetManager, since stuff like the TXTs are used extensively in-game and can be pre-loaded while the player is starting the app, selecting a character and joining a game without locking up assets that are needed ASAP. I can also implement support for bin files for at least a few of the files, since bin files are smaller and significantly faster to read.

I am considering transitioning the MPQ lib to its own module and implementing at least selective native code support for some of the more intensive operations (decompression and exploding), since those appear to be the biggest holdups, and I think handling them with native code might be something to look into.

collinsmith commented 5 years ago

I quickly tried moving Riiablo.files initializer to a separate thread, and this fixed the startup slowdown, but locks are needed while required assets are loading (crashing if I reach the point where a needed asset hasn't loaded as expected). Without the separate thread, startup takes maybe 4-5 seconds on android and maybe 0.5 second or less on desktop.

Also, I have considered just extracting the excel files the first time the program runs on Android, as I think this would provide the easiest to implement/most effective speedup, and I'm already doing this on Android for the audio files, since Android has some issues reading audio input streams, but I don't want to rely on this approach because if the read speeds can be increased, I'd like to look into it to at least gauge the time sink.

collinsmith commented 5 years ago

I started rewriting and redesigning the mpq module using MappedByteBuffer and I'm seeing a potential ~10% speedup, but it has an awful amount of variance during testing (sometimes slower). I am also still seeing an inordinate amount of time reading monstats.txt in pkexplode.

I think I'm just about ready to try and write some native code to try and speed up the low level explode/decompress/decryption operations. I can't say if there will be much if any improvement though, I'll look for some benchmarks.

collinsmith commented 5 years ago

I added another module mpqlib that I will try and use jnigen with. I'm out of my depth here, so I think I'll have to put this on hold. The C pkexplode algorithm and jni is a bit much to deal with at once -- I'll need to convert the code to not use TCmpStruct or alter it to work with java Buffer.

collinsmith commented 3 years ago

Been experimenting with implementing a more optimized MPQ library using netty ByteBuf and MappedByteBuffer. I'm going to try and nail down a decent API to run with, and then eventually we can shift over to a rust implementation if needed. It's still too early to gauge performance deltas, but so far it seems a bit better.

collinsmith commented 3 years ago

MappedByteBuffer was a bit slower for loading mpqs -- maybe because of buffering issues. I'm going to maybe give up on this iteration for now and come back to it later. Looking through the implementations, I realize that DCC can probably be significantly improved.

DCC is read using an InputStream which is a MPQInputStream. DCC are already compressed though (and thus raw in the MPQ files), so it should be possible to perform a faster read on those files specifically -- potentially even passing a direct buffer containing the bytes to the GPU. I think this might be a better area to target than the MPQ library for this specific issue.

After looking into more dcc files, most/all dcc are compressed within d2char but objects/missiles/monsters/overlays don't appear to be.

collinsmith commented 3 years ago

Finished rewriting ByteBuf operable MPQ library #readByteBuf function, and I'm seeing far worse performance compared to the old library (for a single file read though). It doesn't make much sense, so I may just need a larger data set which is hard to produce outside of the engine itself. I may commit what I've done and me or someone who knows more about I/O throughput code can come back to it later. The benefits of the new library is complete pooling of buffers and decompressors which means no locking or synchronization code should be needed. I'm pretty happy with it actually, which is why I'm surprised it's far slower considering the performance/memory optimizations.

Long term, I may need to create a MPQ library port which passes byte[] and operates on those and uses BufferUtils#copy and other LibGDX natives to perform buffer copying and abandon this ByteBuf implementation. I didn't expect ByteBuf to approach byte[] performance, but it's surprising that it's far worse than ByteBuffer.

collinsmith commented 3 years ago

I noticed that there were some slowdowns caused by my logging framework which did not implement pooling of message or events and did not post messages asynchronously, and did not defer some operations until logging was guaranteed to occur. Once I implemented all of the above, I'm seeing about a 30% improvement in the 10k iterations read, beating out the old implementation!

collinsmith commented 3 years ago

Still moving forward and making steady progress. I've decided to make optimizations for ByteBuf and will try and use the more optimal MPQFileHandle#readByteBuf() to directly grab data from MPQ files within codecs. MPQInputStream has been significantly improved and can now retrieve an InputStream consisting of the direct memory, which should provide performance benefits for decoding DCC files, which was a major area that I wanted to improve (this benefit will also specifically help with video files later on because they can be decoded in much of the same way).

Due to the nature of decoding, decoded MPQ files are inherently buffered, so there is no point in using much of the FileHandle API, so I'm going to mark most of it as deprecated so that the new stuff is used by default from now on. I want to look into using ByteBufInputStream as the input stream impl, since it should make reading the data much easier in the case where the entire ByteBuf need not be retained for the lifetime of the decoded asset (or where the coded performs a transformation of the input data).

Something I wanted to look into was "committing" the decoded MPQ file bytes into direct memory buffers for things like sound files and then releasing the heap buffers. I have no idea if this will provide any tangible benefits, but it may with codecs which will be passed to a native processor anyways (like audio, textures, etc) and which the MPQ reader was used only to decompress the asset.

collinsmith commented 3 years ago

I was noticing that the asset loading algorithm was taking a long time on my tablet so I ran some benchmarks to compare mpq_bytebuf (the newer implementation that's dormant) and mpq. I noticed that mpq_bytebuf was quite a bit slower, and also did some I/O tests and found that the bottlenecks appear to be CPU-bound decoding the MPQ sectors, specifically for dcc files, pkexplode was very slow.

I ran a profiler on android and modified pkexplode to work using the backing arrays for copying (vs ByteBuf#get()) and got a fairly significant performance boost bringing the algorithm in line with the mpq package's performance. My loop still takes longer than I would like it too with roughly 60% of the time spent within pkexplode (this includes ~30% loading the MPQ, so perhaps more like 80-90%). I'm not really sure where to go from here without improving this algorithms -- mobile processors can't decode the data fast enough, resulting in sizable loading times for decoding dcc. If this were ported and I could run pkexplode via jni, this may be a solution, but that may require a lot of work.

collinsmith commented 3 years ago

Very rudimentary, but the changes I've made so far are getting pretty nice results. Hopefully this continues.

2021-08-13 19:44:38.634 3130-3447/com.riiablo I/System.out: mpq read in 39ms
2021-08-13 19:44:38.849 3130-3447/com.riiablo I/System.out: mpq_bytebuf read in 29ms
2021-08-13 19:44:38.876 3130-3447/com.riiablo I/System.out: mpq_bytebuf2 read in 10ms

For the most part, I've shifted from overusing the buffer interfaces to mutating the underlying byte[] directly for stuff like pkexplode, which is the biggest culprit for slowdowns that I've noticed. I'm heavily relying on an android benchmark app to instrument and test these code-base changes -- since that platform is much more sensitive to changes.

I think the idea that I'm going to run with -- at least for non-streamed data -- is to cache and read the entire sector datas into the MPQ file handle and then multithread decompress 1 sector per thread (up to n threads). This should be pretty safe and may give me more control in reading data more selectively, e.g., read a file handle into memory and defer some of it's data decompression until needed. If I'm being honest, this may not happen anytime soon, but it would be nice to only decompress the required frames from a dcc and defer the others until they are needed for some additional performance. No disk activity is required, only decompressing/decoding the data.

pkexplode seems to have converged with mpq_bytebuf after optimizing almost as best as I can. I might play around with it for a bit, but I'm out of my depth and it's pretty frustrating. The implementation I'm currently using in mpq and mpq_bytebuf is a mess compared to the C code it should be based from, but that doesn't seem to have much of a performance impact. My only hope is multithreading the decoding may help.

Using an ExecutorService to decode multiple sectors in parallel improved the decoding speed slightly, 8 threads 30ms -> 20ms, 4 threads 30ms -> 25ms. Not the performance improvement I wanted, but this may provide a way to selectively decode as necessary, which should give decent performance gains.

E.g., data\global\chars\pa\lg\palgmedtn1hs.dcc is 36K, to actually decode that file for minimum use, need to decode and cache the sector offsets, save the compressed sectors, and then decode the blocks as-needed. Something like a character might be tricky because directions/animations, but spreading that load across multiple frames may help (ideally <10ms [just a guess] to load all assets for a given frame's worth of data). This would require turning MPQFileHandle into some kind of singleton asset (same instance always returned for a given file name) so that blocks are preserved if a handle is requested multiple times -- or bound to the dcc which retains the compressed sectors.

collinsmith commented 3 years ago

Improvements to pkexplode and using more efficient decoding has resulted in ~130ms -> ~75ms (~60ms when async) for 39 DCC files representing the character select screen. This is best-case scenario and only implemented for DCC which utilize implode only. It's a good enough starting point where translating in a lower level at some point may improve speed further and be easier now that the OO style has been removed from pkexplode.

collinsmith commented 2 years ago

31a60c37d82aeff060dc6fd3f2f1fe00105b7e04 replaces com.riiablo.mpq_bytebuf with improved version that includes speedups. Above test can decode 1 sector of above files in 6-7ms after optimization and jit takes over. Integrating changes into engine will take a bit of time to get some real-world results, but looks promising.

collinsmith / riiablo

MPQ read speedups -- possible native code required #8