Don't clobber files on extraction and skip already extracted files

Afforess commented 7 years ago

This PR is intended to implement the features outlined in #37, and when complete should obsolete my previous PR, #30. ~I'm leaving #30 open for now as it is a working implementation, while this is presently a work-in-progress.~ *Not a WIP anymore.

TODO:

[x] Extract the contents of a rar into a temporary directory
[x] Prevent the contents of a rar from being clobbered when a file with the same name already exists
[x] Handle recursive extraction of rar archives inside of rar archives, for a certain maximum depth
[x] Detect and skip already extracted contents of a rar archive
[x] Write unit tests to cover all of the new features and additional flags

The files not being clobbered doesn't happen exactly the way outlined in #37, it uses the mv -b featureset, which when the file foo exists and mv -b bar foo is called, will move bar to location foo~ instead. The rename suffix of the backup file can be overridden by environment variables, see man mv for details. This approach seems significantly more maintainable than trying to generate a unique suffix and move the file to that location ourselves, when mv can handle it for us. The downside is it's not possible to issue a warning for the renamed file because mv doesn't tell us about it. mv -v verbose output shows the backup rename but parsing it seems like a significant burden for such a tiny gain. Thoughts @delcypher, others?

I'll hopefully tackle the remaining major features in the next few days.

delcypher commented 7 years ago

@Afforess Thanks for tackling this :+1:

Afforess commented 7 years ago

I think the preliminary features are done. There may be some code cleanup/tightening needed, so feel free to call out anything spotted. I didn't try to make the --skip-if-exists flag use the file size / checksum / anything sane, and just tested for every file in the archive existing. This is not ideal, but the disparity in the output between unrar / 7z / rar may preclude any intensive checks without a rewrite. If checksum integrity is needed/desired, I think the files would have to be extracted each time and compared. I'm not inclined to favour this approach, as my own personal needs have this running as part of a cron, and re-extracting all the files in a very large media directory to compare checksums sounds like a terrible way to use my CPU cycles. Alternatively It may be possible to scrap the file size out of the rar archive without extraction with unrar l but I am not sure how valuable that is.

On the plus size, unrarall uses temporary directories in the output/cwd directory to store the extracted intermediate results now, which means that interrupted/broken archives should not contaminate the file listing and falsely trick --skip-if-exists. On the down side, the safe-move feature means that --skip-if-exists will not work if unrarall is trying to extract from multiple rar files and has to choose alternate names. Nothing short of iterating all the files and comparing checksums would work here.

Any other comments? I'll squash my commits if you prefer this all-in-one.

Afforess commented 7 years ago

~I've been using this a bit and I did notice that --skip-if-exists does not always play nicely with rar's split into parts. I'll try and see if this can be fixed.~

I added a test and it seems to be fine.

Afforess commented 7 years ago

I've been using this for a few days, and it seems to work well for the most part. However, I noticed some negative interaction between two new features, --skip-if-exists and --depth (recursive extraction of rar's). --skip-if-exists doesn't correctly detect extracted inner rar archives because of the design. Here's an example workflow where it has issues:

A directory foo/ contains bar.rar. bar.rar contains quz.rar inside, which further contains a file data. (e.g foo/bar.rar/quz.rar/data)
unrarall --skip-if-exists foo/ is executed (note: --depth 4 is implicitly set)
unrarall extracts bar.rar to foo/<tmp dir> and then executes a child-process unrarall --skip-if-exists --depth 3 foo/<tmp dir> to extract any inner rar archives.
unrarall extracts quz.rar inside foo/<tmp dir>/<tmp dir> and then detects no inner rar archives, so data is moved back to the original directory location, foo/<tmp dir>, and exits the child process
unrarall finishes extracting inner rar archives and moves data back to the original directory location, foo/ and exits

In step 4, unrarall was run from a child-process invoked from the original script, so it was unaware that foo and not foo/<tmp dir> was the location it needed to search for any files that already existed and need not be extracted. As a result it re-extracted data again. The failure here is graceful, unrarall didn't lose data, didn't delete anything it shouldn't have, instead it is wasting cpu cycles and not properly skipping extraction.

My initial thoughts there are three options for this remaining issue:

Re-architect unrarall not to invoke step 4 as a child process, instead move the main loop of the script into bash functions so it can be written in a recursive loop.
- This will affect lots of code, high chance of introducing bugs/regressions
- This will take a lot of time to implement
Add a hidden/undocumented switch to tell child-processes of unrarall where to look for data when combined with --skip-if-exists so the child process created in step 3 can check to foo/ instead of foo/<tmp dir>.
- This is an ugly solution. It adds technical debt/maintenance burden.
- This involves changing adding only a small amount of code, low potential for regressions of existing features
Do nothing
- This is cheap, no time/cost
- Acknowledging bugs/unintended behavior and not addressing them sucks

Thoughts @delcypher, @arfoll others?

If "Do nothing" is acceptable, this PR can be merged now, as-is. Otherwise changes need to be made to fix the last outstanding issue.

Afforess commented 6 years ago

Yawn, well a whole year has practically gone by, and I've been using this for a while. I've not seen any major issues - has anyone else tried out the changes in this PR?

arfoll commented 6 years ago

@Afforess sorry - understand the frustration, been busy with other stuff. I've been using this as well (although not really the features since they're not really needed for my usecases...) for a while now and not found any issues, so let me go through it again to jog my memory and I'll happily merge it.

arfoll commented 6 years ago

Merged - thanks alot and apoligies again for the delay.

Afforess commented 6 years ago

Glad this got merged. I hope my comments weren't overly annoying, it probably could have been toned down.

arfoll commented 6 years ago

No worries, bash scripts are frustrating to maintain & patch :)

arfoll / unrarall

Don't clobber files on extraction and skip already extracted files #40