Open ShadowKnightMK4 opened 4 months ago
The idea is currently this plan and for folders.
Off : current action - if we run into multiple folders pointing to the same data, we scanning multiple times.
Partial - off until we encounter a possible file file with more than one link to it which currently seems to be the presence of reparse points only looking at .net file system. data and we just add those to the list. When one is found, again , we skip it.
Full - like partial but each folder is logged regardless.
Conditional - some folders auto added and some are never - depending on the system.
So there's a puzzling aspect. The current implementation DupSearchPruning seems to have a bug or I'm not understanding something.
The current plan - GetHash() on the string is to just store the hash + the location in question in a ConrrentDictionary. We test if the key is there and if so, we skip the search. The screenshot currently appears to that the collision is not detected.
Thread C:\Euphoria (the starting anchor for that worker thread) Thread C:\Euphoria\bin (anchor point in another threat that is a subfolder).
Expected action
Find C:\Euphoria\Bin in our dup list placed there from another thread and skip.
Current Actions. Does not appear to skip.
Technically works BUT slows the search down + crashes soemtimes. Likely will need to adjust expectations.
The de duplicator should make it take less time.
This post is Generated partially from gpt but demonstrates the plan. Note storing folders seen in a simple list doesn't scale well. Consider a system with 20000 individual folders as in c:/windows us a folder and c:/windows/system32 is another . Thate a crazy amount of compares.
Example: For C:\Windows\System32\Drivers, the class:
Adds C:\ as the root. Adds Windows under C:. Adds System32 under Windows. Adds Drivers under System32.
Each folder in a complete path is given its own node and possible sub nodes . Rather than compare 1000s of individual folder paths each time we visite a folder, instead we see if we have a previously created pathway to the final given location in our node list. Us having one means we've seen it already.
Going to the example above,
First check is for the presence of c:. We find it? Next check is for Windows. We find windows? Next one is system32?
Still need to stabilize it and write unit tests.
So I learned just because Unit Tests pass doesn't mean the code it right. This feature is still in progress.
Note this feature may morph to some kind of profiling feature if I can't get it to be minimal acceptable (keeping it delay free as possible)
Objective is avoid links that possibly can point to each other and offer targeted searches/ less work duplication