BiglySoftware / BiglyBT

Feature-filled Bittorrent client based on the Azureus open source project
https://www.biglybt.com
GNU General Public License v2.0
1.59k stars 152 forks source link

(Windows 10) Search for Existing Data Files ... #1651

Closed ms52538 closed 10 months ago

ms52538 commented 4 years ago

Java 1.8.0_202 (64 bit) c:\program files\biglybt\jre SWT v4930r7, win32, zoom=100, dpi=96 Windows 10 v10.0, amd64 (64 bit) B2.4.0.1_B03/4 az3

Hey Team - I hope all is well with everyone. Looks like you've been keeping busy.

I have a potential problem occurring with Search for Existing Data Files ... (SfEDF)

To Validate: First, let me ask for validation of the premise that renaming a Windows file does not affect the MD5 Hash of the file itself. i.e. I download a torrent with 17 files. One of the files is named 'sdfjhlskfhi.mp4' has the MD5 hash '15F774A9B218D96AE34EE21390A0A09F'.

Scenario 1:

  1. At this stage, the torrent is complete, 100%, and operates fine. Using BiglyBT, I stop the torrent.
  2. Using Windows Explorer, I make a copy of the files from the torrent's location on my drive, and put them into a separate root folder as backup.
  3. Using Windows Explorer, while the torrent remains in a stopped state within BT, I then delete the existing torrent's video files in their original location where BT has placed them from the download. (we still have a backup copy of the files).
  4. I use a bulk-renaming utility to bring some sense and order to the copy of backup files I've stored in a separate root folder. (I apply exif tag info read from the file which normally includes things such as Video Length ~ MD5 Hash along with static information I provide such as the Collection Name, Sub-Directory, and the sort. The bulk-renaming app performs the reads and I apply the changes to the names. Now the backup files have had their names changed.)
  5. In BT, I click on the torrent (still in a stopped state) and click to perform a re-check of the torrent. (BT identifies that files are missing, changes the existing status to 0% for the files missing, creates new shell files in the directory (where the others had existed prior), and BT moves the torrent back into an incomplete torrent status).
  6. In BT, I click on the torrent and select to perform a SfEDF function. The pop-up box appears and I direct BT to the root folder of the backup files I have renamed. Using the "LINK" mode, with Tolerance set at 0%, disparity, I click on the search button.
  7. BT identifies 95% of the renamed files and links them successfully back to the torrent. A few files were not successfully identified. (I know that at times, when BT identifies multiple files contained with a torrent with the same MD5 Hash, it will skip over assigning them back to the torrent [even though my bulk-renaming application has identified there are multiple files with the same hash and I have moved to address them by appending a character space and then (1), (2), (3) at the of the files so that the rename event is successful.] So when BT skips linking a few files, it is normally my indicator it is the identical MD5 Hash files contained within the torrent.)
  8. HOWEVER - BT misses identifying and linking other files within the torrent, not just the identical MD5 Hash files. I perform a few spot checks such as validating all the characters are Western, there are no symbols used (just text), etc. in both the file name and the MD5 Hash. With larger torrents that have significantly more files, BT can miss up to 30 or 40 files out of 500.
  9. Using Windows Explorer and going into the root folder containing the backup copy of files that have been renamed by the bulk-renaming utility, I manually change those files which BT identified as NOT BEING FOUND (not linked) back to their original files names as when the torrent was first downloaded.
  10. Using Windows Explorer, I then copy the unlinked files from Step #9, back to the original folder of the torrent. I receive notification the files already existing, and I go ahead and replace them.
  11. Using BT, I go to the torrent and I perform a Perform Force-Recheck. BT successfully identifies the missing torrent files in the directory, and the torrent passes 100% completion and changes status to complete.

In this scenario: (A): BT missed identifying SOME of the existing files located in our root folder backup copy, that had been renamed, but was successful in identifying and linking to other files contained in the same folder as those which were missed. (B) Targeting those files which BT missed in linking in Step (A) above, renaming them through Windows Explorer back to the names that are expected by the torrent, and copying them back into the torrent's folder, BT successful will identify them.

Based upon this, I would not suspect there is an issue with renaming the files and impacting the MD5 Hash - OR - BT's SfEDF scan is not accurately identifying 100% of files targeted in a scan?

Scenario 2: I can duplicate Scenario 1 by following those steps up to and including making a backup of the files, then within BT, DELETING the existing torrent (and all files), then re-adding the torrent in a stopped state, then use the SfEDF and directing BT to the root folder where the completed backup files are located (but in this scenario, I've not renamed the files with a bulk-renaming utility (the files are left exactly as they were when copied from the original completed torrent download.


Based upon Scenario 2, and comparing it to Scenario 1, it appears that BT's SfEDF is not getting an accurate and complete read of MD5 Hash (or the other HASH variants used, I know MD5 is the short one, not sure which one BT uses) from files on the targeted location (in this case a specifically mapped root fold location, but it misses the same if I use the 'default' search locations.

My environment: I am running PLEX, and it does update every 15 minutes or any time it detects a change to a file within any folders/files within its library). I am running Windows Search and Indexing in the same environment as well. So BT is operating in that same sphere, dynamically, as those two services. I believe PLEX has a few additional services for META matching which run in the background (i.e. Python) so I am uncertain if there can be any time of interference in BT's attempt to perform a SfEDF scans when they are performed.

Am I expecting too much of SfEDF or are there variables I need to tweak or consider?

Thanks!

parg commented 4 years ago

The issue at step (7) (multiple files with same hash) should have been fixed in a recent 2401 beta via https://github.com/BiglySoftware/BiglyBT/commit/7820477f268863820991c77ba7655858116d414e )

Are the files that are missed smaller than twice the piece size of the torrent? There are two phases in the matching process - for files that contain at least one piece (hash checking between files can only be done in piece-sized chunks). Once those matches are identified BiglyBT tries to find a common 'root' folder the the match results and then performs exact name based matches on the remaining files.

The log from the SfEDF window might provide clues.

ms52538 commented 4 years ago

I just performed Scenario 1: Here are the log entries pertinent to the issue of 2 files being missed in matching when performed by a SfEDF event: Found 33 files with 33 distinct sizes (Neither of the 2 missing files appear in the search, whereas 31 other files do [file name, testing, linking entries] (the other 2 do not have such entries). Linked 0 of 2

Of the 2 that are missed by the search, in the BT "Size" field, they show their uncompleted 'stub files' to be (as they are unmatched and have not been downloaded into the torrent's folder [aka the torrent is at roughly 99.5% complete]: File #1 is 746.1kB, and File #2 is 484.5 kB

The 'completed' correct Files, as previously backed up to another folder, as reported in Windows Explorer: File #1 is 14,494 kB, and File #2 is 41,037 kB

in BT, under the uncompleted torrent, BT is reporting nothing in the Pieces Tab, and it is reflecting 3 empty boxes within the PieceMap (all others are solid blue).

I'll perform a separate group of tests to validate the duplicate hash issue, possibly today, I suspect it may still not work because I was having issues with a torrent this past Monday.

Thoughts?

parg commented 4 years ago

So the missing files appear likely to be too small to be hash checked. Towards the end of the SfEDF log there should be things generated by

logLine( viewer, dm_indent, "Matched=" + actions_established.size() + ", complete=" + already_complete + ", ignored as not selected for download=" + skipped + ", no candidates=" + no_candidates + ", remaining=" + unmatched_files.size() + " (total=" + files.length + ")");

logLine( viewer, dm_indent, "Looking for other potential name-based matches" );

be interesting to know the log from that point onwards

ms52538 commented 4 years ago

5/21/20 9:23 AM: Enumerating files in P:\HOBBIES\Painting\PaintByNumbers!RENAMED VIDEOS (Backup) Found 33 files with 33 distinct sizes Processing 'PaintByNumbers-Beginner', piece size=512.0 kB Matched=0, complete=41, ignored as not selected for download=0, no candidates=3, remaining=2 (total=46) 5/21/20 9:23 AM: Complete, downloads updated=0

Note: this torrent contains pics but those are located in a sub-directory of their own in the torrent. I'm just focusing on Video Files. (SfEDF does not seem able to identity pre-existing image files, btw. Why is that?)

On Thu, May 21, 2020 at 9:10 AM parg notifications@github.com wrote:

So the missing files appear likely to be too small to be hash checked. Towards the end of the SfEDF log there should be things generated by

logLine( viewer, dm_indent, "Matched=" + actions_established.size() + ", complete=" + already_complete + ", ignored as not selected for download=" + skipped + ", no candidates=" + no_candidates + ", remaining=" + unmatched_files.size() + " (total=" + files.length + ")");

logLine( viewer, dm_indent, "Looking for other potential name-based matches" );

be interesting to know the log from that point onwards

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BiglySoftware/BiglyBT/issues/1651#issuecomment-632076951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE5IYZAVF4AXQNQWWZU5OPTRSUR5NANCNFSM4NGFG6NQ .

ms52538 commented 4 years ago

I exported the torrent into XML, and found these entries for you, if relevant: -

764033

-

PaintByNumbers _46.mp4


-

496209

-

PaintByNumbers _48.mp4

On Thu, May 21, 2020 at 9:28 AM Mark Alan ms52538@gmail.com wrote:

5/21/20 9:23 AM: Enumerating files in P:\HOBBIES\Painting\PaintByNumbers!RENAMED VIDEOS (Backup) Found 33 files with 33 distinct sizes Processing 'PaintByNumbers-Beginner', piece size=512.0 kB Matched=0, complete=41, ignored as not selected for download=0, no candidates=3, remaining=2 (total=46) 5/21/20 9:23 AM: Complete, downloads updated=0

Note: this torrent contains pics but those are located in a sub-directory of their own in the torrent. I'm just focusing on Video Files. (SfEDF does not seem able to identity pre-existing image files, btw. Why is that?)

On Thu, May 21, 2020 at 9:10 AM parg notifications@github.com wrote:

So the missing files appear likely to be too small to be hash checked. Towards the end of the SfEDF log there should be things generated by

logLine( viewer, dm_indent, "Matched=" + actions_established.size() + ", complete=" + already_complete + ", ignored as not selected for download=" + skipped + ", no candidates=" + no_candidates + ", remaining=" + unmatched_files.size() + " (total=" + files.length + ")");

logLine( viewer, dm_indent, "Looking for other potential name-based matches" );

be interesting to know the log from that point onwards

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BiglySoftware/BiglyBT/issues/1651#issuecomment-632076951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE5IYZAVF4AXQNQWWZU5OPTRSUR5NANCNFSM4NGFG6NQ .

parg commented 4 years ago

3 files in the torrent had a file length that didn't occur in any files in the selected folder (hence 'no candidates=3).

2 files did have a file (or more) with the same length but failed to be matched (probably because they were too short)

Unfortunately if no matches occur at all during the SfEDF operation (as in this case) then the subsequent search based on name doesn't occur as a common root folder (from the matching process) can't be found (as there were no matches)

ms52538 commented 4 years ago

That leaves me confused. BT can work with the Torrent, download it, complete the file structure and populate the files. But if rename the entire directory for a moment in Windows Explorer, then within BT I delete the torrent and files so everything is cleaned up, then re-download the torrent again but this time in a STOPPED state and I perform a SfEDF and point it at the directory that was renamed, BT cannot find all the files, but some larger percentage) it had already downloaded? Essentially that is this scenario. Yes? I am confused as to why it can find some, but not all. Sorry, I'm just trying to make sense of what I'm reading.

parg commented 4 years ago

There are files that are too short to be matched by checking their content against other files (if a torrent's piece size is, say, 1MB, then a file has to be at least 1MB in size for an attempt to be made to see if it is the same file or not)

Files smaller than this can only be matched by looking at their name.

Considering the file name might be 'sample.png' and that there may be many 'sample.png' files scattered through the potentially huge file hierarchy that is being searched, the name matching process only kicks off it already matched files (in this matching run) have a common root location.

For example, files A and B have been fixed up from location

x/y/z/A x/y/z/B

the common root here is deduced to be "x/y/z" and name matching only be attempted relative to that root. If the CURRENT matching process has identified no matches then the root can't be deduced.

parg commented 4 years ago

As per your scenario - please reproduce it and send me the entire log from the SfEDF window (email to paul@biglybt.com if you want)

ms52538 commented 4 years ago

That adds clarity and logic, so I'm processing it all. While the SfEDF function is legit, it is problematic that it cannot find pre-existing files that do exist, because of their size. As I have read other posts, people use the function as a legit means to find pre-existing content on their drives that may have been downloaded. Question: is there a way to pre-identify which files within the torrent would 'fail' IF a SfEDF function were to be performed? Because it sounds like any file smaller than 1MB would be at risk of failing.


I know we've chatted in the past about my renaming efforts and having to use an external app to perform that function where 'serious horsepower' is needed - and you coding the ability to 'batch rename' with a pop-up window that I can then copy to a text file, rename the files, copy back over to the pop-up window and apply. BT is not trying to be a bulk-renaming application. But in all seriousness, this is a sort of 'window of opportunity' where a one-size-fits-all solution is needed. On that note, I am using "Advanced Renamer V.3.85" which allows me to build-out renames of files using pre-existing tag information from the files (with the caveat they exist in the Windows File System) (i.e. incremental numbering, checksum, video tags, date/time tags, image tags .... all essential items for some large torrent collections where Order is required to tame the Chaos of file and directory names.

For folks like myself, who manage very large 'eco' systems of torrents through BT (I mean, it is the granddaddy Cadillac of Torrent Apps) renaming files is a regular thing. People who use mainly public trackers probably don't care about a torrent once their download of its files is complete. They can bring Order using any tools like would like to apply to the files But for those of us who interact with private trackers, the need to maintain ratio is important, so the torrents must continue to exist and be made available for uploading. It is time-consuming work to change file names. :/ Hence SfEDF has been a big blessing, as has the batch-rename function. The other component being EXIF functionality which Advanced Renamer 3.85 brings to the equation. IF, big IF, I can pre-identify files that might fail in the look-up BEFORE I attempt to apply build-name changing to all the files in the torrent, it would allow me to perform individual renames within BT using Advanced Renamer's "New Name" where I could simply copy/paste into BT to perform the change with the EXIF info wanted.

Just thinking of options.

Thanks! As always. :) I appreciate your time.

ms52538 commented 4 years ago

Just emailed you with the scenario data, and screenshot :)

parg commented 4 years ago

B05's out, hopefully fixes things somewhat!