arsenetar / dupeguru

Find duplicate files
https://dupeguru.voltaicideas.net
GNU General Public License v3.0
5.33k stars 414 forks source link

Find a file within another (larger) file #767

Open abolibibelot1980 opened 3 years ago

abolibibelot1980 commented 3 years ago

Describe the solution you'd like A worthy feature would be the possibility to find a file anywhere within another (obviously larger) file, whether or not the beginning of the smaller file matches the beginning of the larger file. It would allow, for instance, to detect individual files which are included in ISO images or (uncompressed) archives. It would also allow to match fragments of files extracted by a data recovery software in “raw file carving” mode which are actually parts of valid files recoverable through filesystem analysis. It happens in particular with video file types which do not have a typical structure with a single header followed by video / audio streams, but are composed of multiple chunks, each beginning with its own header (and can for that reason be considered as individual files by data recovery softwares), like MPG / VOB / MTS video files (I've asked about that particular issue at SuperUser but didn't get any insight ; also at HDDGuru, also to no avail). When recovering all the data from a 4TB HDD, using R-Studio (in filesystem analysis mode and in “raw file carving” mode) and Photorec (specialized in “raw file carving” recovery), I ended up with ~350GB of MTS files and ~550GB of MPG files as “raw” or “extra found” files ; the vast majority of those files are actually fragments of valid and complete MPG / VOB / MTS / M2TS files (which have been recovered by R-Studio in filesystem analysis mode), but there's no easy way of determining, for each of those files, if it indeed belongs to another, and which one, and if it entirely matches a part of that file or if it's different beyond some point (in which case — most likely due to the original file being fragmented on the source drive — the next part may or may not belong to another file, further examination is required to try to match the yet unidentified part). The goal here is to get rid of anything redundant and keep only fragments which have no counterpart in complete files.

Describe alternatives you've considered What I could come up with so far is as follows : 1) Extract a short string of hex values at a fixed offset near the beginning of each unidentified file fragment into a text file with a Powershell script (I got help for that here) ; for instance : 20 bytes at offset 40000. 2) Then, load this list into WinHex as a list of search terms and run a “simultaneous search” in “logical” mode (meaning : it analyses a given volume on a file-by-file basis, and reports the logical offset where the string was found in each individual file). 3) Then, based on the search report, for each group of identified files matching one of these extracted strings, compare checksums between the whole file fragment and the corresponding segment of the bigger file (to that end, I edit the search report into a Powershell script using a small CLI tool called “dsfo” to compute MD5 — it could probably be done with Powershell alone, but it works well and makes for smaller scripts), and delete the fragment if indeed there’s a complete match. If MD5 don't match, then it means that only a part of file A is included in file B and I have to examine them manually, or extract the unidentified parts as new files and run a new search with the same method until everything is thoroughly accounted for (or until I give up this convoluted madness !).

But that method is quite complicated and tedious, and another difficulty is that, whatever offset value I choose, there are always hundreds of strings (out of a few thousands of files) which are not specific enough to yield only relevant matches (for instance “00 00 00 00 …” or “FF FF FF FF…”, or even more complex strings which happen to be present in many unrelated files). WinHex itself has a “block-wise hashing and matching” feature (only available with the “Forensics” license), which would seem like it could do what I want, but it creates a hash database for every single 512 bytes sector of each input file (it can’t be set to a bigger block value), requiring a huge amount of space just to store that (a MD5 hash is 32 bytes, so building the hash database for 500GB of input files requires about 30GB, and actually twice that amount since it first creates a temporary file) ; and then the result of the analysis is useless for that purpose as, contrary to the “simultaneous search”, it doesn’t report logical offsets (relative to the beginning of each file), it only reports a list of all sectors from input files indexed in the hash database which were found at physical offset X on the analysed volume. So, back to square one. é_è

So, to do this in an automated way would require a utility that can : 1) Compute the checksum of a small block at the beginning of each file in group A (for instance : the first 4KB cluster, that value would have to be adjustable), and store those checksum values in memory or in a temporary database. 2) Then scan files in group B, computing checksums of each block, until a match is found with one value generated in step 1. 3) Then, compare the entirety of the corresponding file from group A with its presumed counterpart from group B, until the end of either file, or until a discrepancy if found. And report something like :

File "f5626998852.mpg" [size 36050944] [offset 0-36050943] matches file "VTS_05_1.VOB" [size 273133568 bytes] [offset 22661120-58712063]
Matching score : 100% (relative to size of "f5626998852.mpg")

(This is an actual example from two files I compared with WinHex, based on the method described above ; file "f5626998852.mpg" was recovered by Photorec, and is entirely included in "VTS_05_1.VOB" which is part of a DVD folder.)

(That method would work in that kind of scenario, it would also work to find files included in ISO files, but it wouldn't work to find files included in uncompressed archives as they wouldn't be aligned at regular sector / cluster boundaries. In this case, it would be necessary to index and search strings of data instead of block checksums, the difficulty being, again, that short strings of data must be highly specific to yield relevant matches only ; perhaps this could be assessed with an entropy analysis : parse each file until a short X bytes string is found that matches a given entropy threshold, then store that string as well as its relative offset in the temporary database, then search simultaneously all strings stored in the database in the target folder(s).)

FredWahl commented 3 years ago

That could work for known compound file types like zip, rar and pdf files. It could also be done if dupeguru had an api such that you could make your own comparison ad-in. I have suggested this, maybe you want to comment my suggestion?

abolibibelot1980 commented 3 years ago

@FredWahl Sorry for the long delay... Where did you write that suggestion you mentioned above ? In fact, as silly as it may sound, I wrote all this before I actually tested DupeGuru, and then, after I did test it, I quickly realized that it was much less sophisticated than I thought it would be — based on its name and a strong recommendation I had read somewhere. So what I requested was far beyond the scope of this program as it currently exists. In fact I'm surprised that it was so strongly recommended, as it it far less sophisticated than AllDup for instance, at least for basic binary comparison (I haven't tested the picture similarity approach), or DoubleKiller which hasn't been updated since 2007 (and has the major caveat of not recognizing Unicode characters) but allows to set the comparison parameters much more accurately.