Ability to quick extract individual entry data without iterate through the entry list

xiaogdgenuine commented 4 years ago

Thanks for providing such a good project! Everything about RAR just works fine as it means too, my first iOS app will get a lot of enhancements from your works, thank u so much again!

One little thing though, I don't know if it's a RAR limit or it's something we can overcome. Right now we can select and extract individual entries from the Archive file, these functions can be used:

extractDataFromFile
extractBufferedDataFromFile

Both accept a filename of selected entry as a parameter, the problem is it seems that every time I call these functions, they need to iterate through all entries in the archive and locate the corresponding entry by the filename parameter(with == comparison I guess), and this will slow down the extract speed especially if my RAR archive got more than 10K entries...

My app is a comic Reader that can use to open ZIP or RAR format comic books, ideally, user can view the book pictures as soon as they click on the archive file, and I don't need to extract the entire archive, the individual extraction happens frequently during the comic page switching.

In ZIP archive, they got a useful attribute for each entry call relative offset to local file header, which basically tell u where u should start looking for the entry's content in the disk, once u have that attribute for every entry(by reading them from Central Directory section of the Zip Archive), u can extract individual entry very fast.

I'm wondering if unrar has similar attributes like this? If not then I have to extract the entire RAR archive ahead to achieve best user experience, and it's a bit waste of CPU & storage resources...

And from the screenshot we can see that UnrarKit seems always use the pattern:

open > locate entry > extract > close

is there any way to keep the archive open? Since I need to extract entries over and over again.

Reference: https://en.wikipedia.org/wiki/Zip_(file_format)#Central_directory_file_header

abbeycode commented 4 years ago

Hey, thanks for the feedback! I use UnrarKit and UnzipKit in my own comic reader, for what it's worth. I'm curious if you're noticing specific performance problems, or if it's just a lot of logging. The logging shouldn't materially slow down production apps, but if you don't turn it off in Xcode, debugging can definitely become slow.

It also it's scanning through the whole file every time to find the file headers. It does read from RAR's equivalent of the ZIP format's central directory. I could imagine reading the directory into memory and then using that to seek to files, and it might potentially save some amount of time, but at the expense of a larger memory footprint.

If you do have a specific case that's taking a long time, please send it my way. I'm sure there are ways that the library could be made more efficient, but I definitely follow the path of avoiding premature optimization. Can you send me an archive that's taking longer to extract a file using UnrarKit than another library or app?

xiaogdgenuine commented 4 years ago

@abbeycode Thanks for the info, u are right I shouldn't put my judgment too early, I will try to profile in Release mode, my test rar archive contains about 2k pictures and quite large, so I think it's a bit hard to upload it and send over to u.

I will share my profile results later. If it's really due to the log then we can close this issue, if it's not I will try to generate a small rar archive with many empty files in it and send to u.

xiaogdgenuine commented 4 years ago

@abbeycode I run some rough profiling, and the result is really interesting, I use a test rar file with about 900 pictures in it, here is the result:

Extract entries one by one, and save to a disk file:
Extract entries one by one, and don't save to disk file:
Extract all entries at once, save to disk

Here is my test archive(190MB), it's not applying any compression, all imgs just Store in the rar file: https://drive.google.com/file/d/1wuKlNffCmZDkLKF2H7fvaBinRSjOxuLm/view?usp=sharing

All code running in Release mode, 2015 MacBook Pro, 2.2 GHz

It's really a huge difference and did surprise me... and the time is also increased when the archive's entry count increased, since u need to iterate through more entries each time u try to extract one.

And somehow I failed to suppress the verbose logs by using sudo log config --mode "level:default" command, so I use this patch to disable the logs: https://github.com/aonez/UnrarKit/commit/7a7c6d2716f41d1cc040a7b3a1563a6d2e37d4b7

xiaogdgenuine commented 4 years ago

Instead of using the compiled version of UnrarKit, now I'm embedding the UnrarKit project into my project directly, so I can read & modify the source code, I will try to understand the code and see if I can add one more offset attribute to the entry struct, and use it for extraction to replace filename.

I will come back to u if I need some help, thank u!

xiaogdgenuine commented 4 years ago

I made it!!!

I record the header offset of entry when we iterate through the entire entry list(which only happens once), and use that header offset to ask unrar seek the archive at correct position, then do the normal extraction.

The total time saving is remarkable for random individual extraction now, it's less than 10s, and I think we can do even better if we keep the RAR archive open until all individual extraction operations is finished, I will try that later.

I will submit a PR or something to show u how I do it later, but I only know a little bit about Object C & C++, so the code should be terribly ugly.:P

abbeycode commented 4 years ago

That’s great. I can work with you to refine your PR, but I’d say since the point of this change would be for a performance gain, if need to see a unit test up front that I can use to compare the new approach to the old approach. There are plenty of examples already in the codebase. I use the RAR command line tool to generate archives with large files and with large numbers of files.

This Apple article (specifically the “Write a Performance Test” section can show you how to write a performance test.

I can work with you to refine the PR, or if you’d prefer, I could ultimately merge it into an intermediate branch and refine it myself.

I look forward to seeing what you come up with!

xiaogdgenuine commented 4 years ago

@abbeycode I just push my PR, it would be great if u can spend some time to help me refine & review it, it's my first time trying to modify some code in such low level, and if not thing is broken that would be a miracle to me haha :P.

Anyway, the PR is here: https://github.com/abbeycode/UnrarKit/pull/88

I test with a couple of my test archives, solid, single volume, multi volumes, all works fine, but u definitely understand the RAR format better than I do, so point me out if I miss something, thanks!

abbeycode / UnrarKit

Ability to quick extract individual entry data without iterate through the entry list #87