ForNeVeR / TruePath

File path abstraction library for .NET.
https://fornever.github.io/TruePath/
MIT License
55 stars 9 forks source link

Improve case-sensitive path comparison #20

Open ForNeVeR opened 6 months ago

ForNeVeR commented 6 months ago

I suggest the following changes.

  1. Introduce three different path comparator kinds.

    • [x] Textual only. This one should operate on strict string equality, and named accordingly (something like StrictStringPathComparer?).
    • [x] Platform-default comparer: should implement case-sensitive comparison on Linux, and case-insensitive (probably with corresponding relaxations related to Unicode normalization) on Windows and macOS.
    • [ ] File-system-aware comparer: for each compared path component, should compare the actual case sensitivity of the corresponding file system subroot. For non-existent paths, it should use the platform-dependent policy of calculating the case sensitivity for new subdirectories (is it normally taken from the parent directory?).

      This one is obviously IO-intensive, so I'm thinking of introducing some sort of "sensitivity cache" that'd store the lists of checked paths and subtrees in a trie data structure, and would be used for one or multiple operations (probably one per comparer instance, with the ability of manual reset).

  2. Allow the paths to use different comparers; platform-default being used by default, as one giving the best precision while not losing performance dur to intensive IO.
ForNeVeR commented 2 months ago

Since @Kataane asked a question about the "file-system-aware" comparer in #84, I decided to elaborate on it here.

You see, in the real world, there is no such thing as a "case-sensitive operating system". There is a "case-sensitive path", or a "subtree", if you will. So, in the harsh reality, each path on the disk has its own comparison rules!

On Windows, you can control this on per-path basis using fsutil file setCaseSensitiveInfo, see details here.

On macOS there are some other crazy ways to switch this, and on Linux, this is obviously at least a per-mount point thing (as most common drivers try to support Windows case-insensitivity natively).

The third path comparer would request this information from the actual file systems that are inspected, during path comparison, and use it when needed.

In particular, let's imagine this scenario: you are on Windows, and have the following directory structure:

C:\ [case-insensitive, default]
C:\Path [case-sensitive]
C:\Path\Subpath [case-sensitive]
C:\Path\Subpath\Insensitive [case-insensitive, say it was manually restored after creating this dir]

And our comparer is asked a question: are paths C:\Path\SubPath\Insensitive and C:\Path\Subpath\Insensitive equal or not?

I imagine it should work like this:

So, as the result of comparing paths C:\Path\SubPath\Insensitive and C:\Path\Subpath\Insensitive, we get the result false, and the cache (that might be kept per comparer instance for now) gets information about C:\Path\ (that its children are stored in a case-sensitive way).

Obviously, this will require quite a lot of work from us, and it will be quite slow in practice (magnitudes slower than the default comparers). But I believe it is a "must have" feature of a file system path library.