BurntSushi / walkdir

Rust library for walking directories recursively.
The Unlicense
1.21k stars 106 forks source link

feat: add `sort_by_file_name_case_insensitive` method #174

Open lucatrv opened 1 year ago

ChrisDenton commented 1 year ago

Not to be too nitpicky but maybe this should be sort_by_file_name_lowercase? Unicode make a distinction between case folding and lowercasing. From the Unicode specification (3.13 Default Case Algorithms):

Case folding is related to case conversion. However, the main purpose of case folding is to contribute to caseless matching of strings, whereas the main purpose of case conversion is to put strings into a particular cased form.

Lowercase_Mapping(C) is used for most characters, but there are instances in which the folding must be based on Uppercase_Mapping(C), instead. In particular, the addition of lowercase Cherokee letters as of Version 8.0 of the Unicode Standard, together with the stability guarantees for case folding, require that Cherokee letters be case folded to their uppercase counterparts. As a result, a case folded string is not necessarily lowercase.

BurntSushi commented 1 year ago

Hmmm right, I forgot that std doesn't have a case folding API.

I'm unfortunately inclined to say no to this and instead add some docs that shows how to do it with sort_by and something like the caseless crate.

lucatrv commented 1 year ago

How about renaming the method sort_by_file_name_lowercase to make it clear what it's doing? Or is there a better way to deal with this?

BurntSushi commented 1 year ago

I'm not a fan of that. It makes it too easy to do the wrong thing. If we're going to have a case insensitive API then we either do it right (which means either implementing case folding ourselves or introducing a dependency) or we don't do it at all and point folks in the right direction with docs. I'm not inclined to maintain the former at this time, so the latter is the path forward here.

lucatrv commented 1 year ago

IMHO since file names are typically at least alphanumeric (as opposite to numeric only, if not even Unicode), also the current sort_by_file_name method makes already easy to do the wrong thing. In my opinion to make this matter clear to the user, the best would be to add the sort_by_file_name_lowercase method, clarifying in its documentation what it does and relative limitations, and how to reach full case folding when needed. Otherwise the two options (just lowercase or full case folding) could be described in the documentation of the sort_by_file_name method (better with examples).

lucatrv commented 1 year ago

After reading the following interesting references:

https://stackoverflow.com/questions/40250988/how-can-i-case-fold-a-string-in-rust/75526819 https://stackoverflow.com/questions/45745661/lower-vs-casefold-in-string-matching-and-converting-to-lowercase/74702121#74702121 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf https://www.w3.org/TR/charmod-norm/#definitionCaseFolding https://sts10.github.io/2023/01/29/sorting-words-alphabetically-rust.html https://icu4x.unicode.org/doc/icu_collator/index.html https://unicode-org.github.io/icu/userguide/collation/ https://www.unicode.org/reports/tr35/tr35-collation.html

now I agree with your point. So please let me know if you prefer to add the sort_by_file_name_lowercase method and describe in its documentation when it falls short, including a better example using icu_collator, or otherwise if you want all this described directly in the sort_by_file_name method's documentation.

lucatrv commented 1 year ago

Here is the example that I wanted to provide, using icu_collator:

use icu_collator::{Collator, CollatorOptions, Strength};

fn main() {
    let mut options = CollatorOptions::new();
    options.strength = Some(Strength::Primary);
    let collator: Collator =
        Collator::try_new_unstable(&icu_testdata::unstable(), &Default::default(), options)
            .unwrap();
    for entry in WalkDir::new("test").sort_by(|a, b| {
        collator.compare(
            &a.file_name().to_string_lossy(),
            &b.file_name().to_string_lossy(),
        )
    }) {
        println!("{}", entry.unwrap().into_path().display());
    }
}

Unfortunately it does not compile with error:

error[E0277]: `Rc<Box<[u8]>>` cannot be shared between threads safely
   --> src\main.rs:64:47
    |
64  |       for entry in WalkDir::new("test").sort_by(|a, b| {
    |  _______________________________________-------_^
    | |                                       |
    | |                                       required by a bound introduced by this call
65  | |         collator.compare(
66  | |             &a.file_name().to_string_lossy(),
67  | |             &b.file_name().to_string_lossy(),
68  | |         )
69  | |     }) {
    | |_____^ `Rc<Box<[u8]>>` cannot be shared between threads safely
    |
    = help: within `Collator`, the trait `Sync` is not implemented for `Rc<Box<[u8]>>`
    = note: required because it appears within the type `Cart`
    = note: required because it appears within the type `Option<Cart>`
    = note: required because it appears within the type `Yoke<CollationDataV1<'static>, Option<Cart>>`
    = note: required because it appears within the type `DataPayload<CollationDataV1Marker>`
    = note: required because it appears within the type `Collator`
    = note: required for `&Collator` to implement `Send`
note: required because it's used within this closure
   --> src\main.rs:64:47
    |
64  |     for entry in WalkDir::new("test").sort_by(|a, b| {
    |                                               ^^^^^^
note: required by a bound in `WalkDir::sort_by`
   --> C:\Users\trevisal\.cargo\registry\src\index.crates.io-6f17d22bba15001f\walkdir-2.3.3\src\lib.rs:396:54
    |
396 |         F: FnMut(&DirEntry, &DirEntry) -> Ordering + Send + Sync + 'static,
    |                                                      ^^^^ required by this bound in `WalkDir::sort_by`

For more information about this error, try `rustc --explain E0277`.

Can you please give me some advice on how to make it work?