Alcaro / Flips

Floating IPS is a patcher for IPS and BPS files.
Other
309 stars 45 forks source link

Support files larger than 256MB for BPS creation? #16

Open NintendoManiac64 opened 5 years ago

NintendoManiac64 commented 5 years ago

After doing some testing via tones generated by Audacity, I've discovered that Flips seems to throw up a "files are too big" error on any file that's 256MB or larger when trying to create a BPS file...which is kind of problematic in the era where games can be several gigabytes in size.

Also interestingly, I was originally able to create a BPS patch from a 900MB game by using Flips v1.12, but when I was then unable to create a patch for a game that was 2.5GB, I updated to Flips v.1.31 only to discover that it couldn't even create the BPS patch for the 900MB game let alone the 2.5GB game, yet going back and using Flips v.1.12 resulted in it also being unable to create a BPS patch for the 900MB game...

(speaking of older versions, I can't help but notice that v1.31 removed the ability to create speed-optimized BPS patches)

Alcaro commented 5 years ago

Flips is perfectly happy with files above 256MB, as long as it can allocate enough memory.

Unfortunately, 'enough memory' is quite a lot; Flips needs 5 times the sum of source/target files, so your 900MB files (assuming source is 900MB too) would need nine gigabytes of RAM. And that's with a 2GB limit on total input size, which I can only raise by raising RAM use to 9 times input sizes; when I made Flips 1.30 back in 2015, using 36GB RAM was completely unthinkable, so I didn't even try.

And the official Windows EXE is 32bit, and limited to 2GB RAM. It's theoretically able to process 400MB source+target, but may get trouble a bit below that.

Long-term, the best fix would be creating a new patch creator function that doesn't need as much RAM. But I've got a lot to do, and can't promise it'll be done anytime before 2023.

Short-term, the easiest fix would be to simply switch the official releases to 64bit. 32bit was chosen because Windows XP can't run 64bit programs, but there's no reason to care about XP in this year. I have no plans on making a proper release anytime soon, but if it helps, here's a 64bit binary of current master (I'll delete it in a week or so). https://floating.muncher.se/temp/flips.exe

As for the speed-optimized mode, 1.30 speeds up the slow creator a lot, so I deemed the new slow one fast enough for all practical purposes (patches are for distribution, which isn't needed very often), and hid the fast one in the GUI, for simplicity. If you still need it, it's still available from the command line; it doesn't have the 2GB limit, so if your 2.5GB files don't move anything around within the file, it should yield good enough patches.

NintendoManiac64 commented 5 years ago

So trying with the 64bit version, I was able to successfully create a BPS patch for a 659MB file with ~12GB of physical RAM free even with GASP my pagefile disabled, yet I got the "files are too big" error with a 1.28GB file on the same system with ~14GB of physical RAM free plus a pagefile configured to 16GB-32GB (not a typo) and located on a Samsung Pro SSD.

Is this the correct and expected behavior?

Oh and for reference, it took 4min 45sec to create the BPS patch for that 659MB file on my 4.5GHz Pentium G3258 (2c/2c Haswell) - even if I could create patches for larger files, I can't imagine how long it'd take, and even the newest 5GHz Coffee Lake-based CPUs are only going to be like maybe 20% faster at best due to Flips relying on single-threaded performance (10% gain from IPC, 10% gain from clockrate).

Alcaro commented 5 years ago

That is expected behavior for two 1.28GB files. Two 1.28 files = 2.56 GB = hits the 2GB limit I mentioned.

Correct ... probably not. It was good enough five years ago (and the best I could do five years ago), but files have grown since then, and Flips hasn't kept up.

Single-threaded ... Flips can use OpenMP, which speeds it up a bit, at the cost of adding a bunch of DLLs. I want Flips to remain a single exe, even more than I want it fast.

But if I do some ugly tricks, I can combine the DLLs with the EXE... https://floating.muncher.se/temp/flipsmp.exe

Speed sounds fairly reasonable to me. Kinda low, but maybe your Pentium is slower than my i7-3770. The time taken is roughly proportional to file size, so 1.28GB files would be about ten minutes for you (except it wouldn't work at all, due to the 2GB limit).

If you want to patch >1GB files, I recommend using Xdelta instead of Flips. It was made by people smarter than me-five-years-ago; its patches are roughly the same size as Flips, and it's ten times faster. If you need BPS, go command line and use --bps-linear, or wait a year or eight until I've rewritten the BPS creator properly.

NintendoManiac64 commented 5 years ago

「Two 1.28 files = 2.56 GB = hits the 2GB limit I mentioned.」

But this was with the 64bit 1.40-pre EXE that you posted, not with the v1.31 EXE...shouldn't this 2GB limit not exist then?

I also tested with a 988MB file and confirmed via the Windows task manager that the 64bit flips.exe was using 10+GB of memory, so the 64bit memory address limits do indeed seem to be working.


「maybe your Pentium is slower than my i7-3770」

Oh no, my Pentium should be considerably faster in terms of single-threaded performance as it not only has an IPC advantage (~5-10% typically, 20-30% in emulation workloads) but a clockspeed one as well (3.9GHz max turbo for your i7, 4.5GHz for my overclocked Pentium).

For reference, Core-based Pentiums are basically nothing more than i3's with fewer threads and no AVX.

I admittedly tried Xdelta via "DeltaUI", but it was creating much larger files for that previously-mentioned 900MB game (read: patches in the multi-hundred MB sizes rather than ~7MB for speed-focused BPS patches)...but for all I know i was probably doing something wrong.

Alcaro commented 5 years ago

No, 32bit has two limits. Both are caused by the fact that, for each byte in the two input files, Flips must store that byte, plus a file offset (latter being a 32bit integer).

The first limit is that Flips needs five bytes of RAM per input byte, and 32bit programs are limited to using 2GB RAM. (32bit programs theoretically support 4GB, but I forgot enabling that for Flips 1.30 - I'll do better for future programs.) 2GB / 5 = maximum ~400MB total file size (minus a bit due to memory fragmentation). Switching Flips to 64bit removes that limit.

The other limit is that, beyond 2GB, the file offsets no longer fit in 32bit integers. (Theoretically, 32bit fits 4GB files, but some of Flips' internal calculations require an extra bit, making the limit 2GB.) I could make the offsets 64bit, but that wouldn't help very much unless you have 36GB RAM.

I have a few ideas for how to fundamentally replace the entire thing, allowing it to work without tracking zillions of file offsets, but I don't know how well those ideas would work in practice. Even if it works out, it will take quite a while to implement.

I'll admit I don't know how processors compare. All I know is i7 is fast and non-i is slow, for some definition of 'slow' that may not necessarily be the correct one here.

Xdelta gave near-identical sizes to Flips for the files I tried, but perhaps you're hitting some bad case in Xdelta. If you share your files, I'll use them for performance tests, once I get to that point.

NintendoManiac64 commented 5 years ago

Alright, thanks for the explanation regarding 32bit.

...but I would still like to understand why my 1.28GB files did not work on the 64bit 1.40-pre EXE that you provided. I mean, it's 64bit, not 32bit - why would 32bit limitations even be applicable for such a situation?

As for "future programs", is it not possible for a future version of Flips to be compiled with both a 64bit and a 32bit EXE bundled in a single archive as is done for several other applications? (CPU-Z in particular comes to mind).

And regarding i7, it's definitely not that simple at all. The general rule is that a higher number means more cores and threads, but cores and threads don't help single threaded applications. The higher number CPUs also tend to have higher clockspeeds as well, but not by much, and usually not enough to make them faster than newer-generation parts (which tend to be faster on a per-GHz basis as well as having higher clocks). This is why non-i CPUs are popular for budget emulator users as emulators only really can use two CPU threads (outside of 360/PS3/WiiU stuff), so the Pentium G3258 was VERY popular for emulation considering that it was unlocked for overclocking, allowing outstanding bang-per-buck emulation performance (as an example, refer to the Dolphin 5.0 benchmark: https://forums.dolphin-emu.org/Thread-unofficial-new-dolphin-5-0-cpu-benchmark-results-automatically-updated--45007 where my Pentium G3258 got the exact same time as an i7-4790k as both CPUs are of the same generation and were both overclocked to 4.6GHz).

But the thing is, even a first generation 4core/8thread i7-860 is actually very likely to be beaten by a modern 4core/4thread i3-8100 even in workloads that can use every single CPU thread that you throw at it. And that's before considering that Xeon and Ryzen exist, both of which can also be faster than your i7-3770 (not to mention that i9 is now a thing in order to compete with Ryzen Threadripper).

Alcaro commented 5 years ago

32bit vs 64bit programs refers to whether it requires hardware support for 64bit math and 64bit address space. But 64bit programs can still do math on 32bit quantities, if they feel that's enough or better; Flips does, so it's still subject to 32bit overflow. As I said, I could switch it to full 64bit, but that would blow up the already-obscene RAM requirements even more.

Yes, I could provide both 32bit-address-space and 64bit-address-space Flips. But I like how Flips is a single exe with zero dependencies; changing it to two exes will force people to choose which one, and considering how rare 32bit-only hardware is these days, I doubt it's worth the extra confusion.

Regarding processors, understood. Not simple indeed. Software performance depends on ridiculously many factors: clock speed, instructions per clock (even within the same processor, division is slower than addition), operations per instruction (SSE can do four multiplications in one instruction, if the program uses SSE instructions - some programs don't, some can't), results per operation (Flips 1.21 uses a straightforward but slow algorithm, Flips 1.30 is much smarter and needs way fewer operations), threading, operating system (Spectre mitigations, etc), other load on the machine (media players, browsers, etc)... sometimes it's near-impossible to know what's faster and what's just noise.

Back in 2015, I could create a patch from two 590MB files in 90 seconds. OpenMP gives a ~2x speedup on my machine (I use Linux, where software distribution works way differently from Windows, and there's no reason to avoid extra dlls), so that's 180 seconds without that; your 659MB files are bigger, which accounts for another 20 seconds. No clue where the last 85 seconds would come from.