WebAssembly / wasi-filesystem

Filesystem API for WASI
167 stars 18 forks source link

Deterministic/Reproducible Directory Listing #12

Closed indolering closed 2 years ago

indolering commented 3 years ago

The order in which files in a directory are listed is usually determined by when they were added to the filesystem b-tree and is a common source of non-determinism. A deterministic filesystem API would sort directory listings and iteration based on filename.

Unicode provides a local-invariant default sorting algorithm (DUCET), which can be made deterministic. But it is not suitable for use in WASI for several reasons:

The consensus on the ICU support mailing list was that code-point order is the only sane option: given the high-bar of a portable deterministic sort order, anything more complex is an implementation hazard.

Plain code-point order would be simplest to implement and the only truly version-less sort order. However, subtle platform differences may alter the raw bag-of-bytes filename encoding: Å could be encoded as the nordic letter U+00C5, the symbol U+212B, or the ASCII letter A/U+0041 followed by "COMBINING RING ABOVE" °/030A. Non-exhaustive list of things that can alter filename encoding:

The "solution" is to perform the initial sort on some maximally compatible canonical normalization, using progressively less extreme normal forms (and eventually the raw byte-string) as tie-breakers. As the file store ultimately decides whether two strings are equivalent in a given namespace, this would result in an identical sort or raising an error. However, the idiot-in-a-hurry implementation of this "solution" (i.e. NFKx_casefold -> NFKx -> NFx -> WTF-8 byte-string -> UTF-8 byte-string) introduces its own set of stability, complexity, and performance issues.

TL;DR

  1. Top level code-point sort based on a maximally compatible canonical normalization/lowest common denominator normalization to remove platform dependent differences.
  2. Straight NFC tie-breaker. This is roughly at the level of what most file stores use, except for edge cases like default ignorables and fractions.
  3. Raw binary tie-breaker. Again, the design intentionally minimizes the need for this, as differences due to UCS-2/WTF-8 make these comparisons non-nonsensical/non-portable.

I'm hesitant to suggest any further refinements without implementer feedback.

indolering commented 3 years ago

Casefold?

I don't think a case-folded comparison is necessary for deterministic sort order, as only ancient filesystems do not preserve case. It theoretically reduces entropy in some silly edge cases, but it is probably unhelpful in practice. I would only add a casefolding comparison level as an option if someone from a reproducible build community says it would be helpful or we have to store NFKC_casefolded form anyway.

NFCK_Casefold

Edit: The Unicode standard pushes for the use of NFKC_Casefold as the identifier, going so far as to pre-compute an optimized character mapping in the UCD and providing dedicated NFKCCasefold functions in the ICU. It performs NFKC normalization, removes "default ignorables" (meta characters used for formatting), and has the same stability guarantees as regular normalization forms. The fact that it casefolds doesn't really matter because of the NFC level tie-breaker.

This may represent the greatest ROI for the least amount of effort and bike-shedding. The only thing that gives me serious pause about using NFKC_Casefold for the top-level sort is that it doesn't remove control characters.

indolering commented 3 years ago

NFD vs NFC

The choice between Composed (Å) vs Decomposed (A + °) canonical Normal Forms (NFC vs NFD) is somewhat arbitrary, as NFC(NFD(string)) == NFC(string), so choose whatever is most convenient.

Some 99.98% of text is in NFC form anyway, so it is faster and less disruptive than NFD.

NFD is the most commonly used form in filesystems, but I cannot find any good sources for the rationale. This i18n filesystem thread speculates that HFS+ chose (a variant of) NFD because it was originally going to be the only stable normal form or because it is theoretically faster. The Linux implementation of case-insensitivity uses NFDi (NFD - default ignorables, but not control codes!) because they thought that NFD would be faster than NFC, which is a common misconception (most text is already in NFC form and the fastest thing is to do nothing).

indolering commented 3 years ago

NFKC

I was hoping to skip the NFKx level and define a single ~NFC-superset level comparison that would ignore the differences in normal forms found in popular filesystems. However, this turned out to be fairly brittle, as each filesystem performs slightly different non-standard normalization forms for slightly different reasons. I also found out that different input methods may insert strings that won't be identical even in NFx form (i.e. Korean: "same keypress will result in U+3131 in Windows but U+1100 in Mac").

When you factor in files that use compatibility code points due to maintain round-trip-ability with old code pages, NFKC seems necessary.

indolering commented 3 years ago

Lowest Common Denominator

This requires implementer feedback.

The simplest to implement is NFKC_Casefold, which removes default ignorables and is available as an optimized function in the ICU and a character mapping file in the UCD. However, control codes are not default ignore-able and it casefolds, which is unnecessary. This also removes the non-zero width joiner, which is necessary in some scripts.

We could do NFKC - default ignorables - C1 - C0 (ASCII control codes). This would be the least destructive but also a snowflake.

We could also use the PRECIS NFKC Freeform profile. If we use PRECIS in other parts of the standard, then it would make sense to use it here. This has the benefit of additional controls around the ZWJ and bidirectional text. It would, however, remove unassigned code points. The advice from the Unicode mailing list was to ignore unassigned code-points, as new characters with non-zero ccc values are rarely assigned and as files from newer systems might use characters from minority groups. Instead, an implementation should emit a warning when encountering unassigned codepoints and error depending on a configuration option.

sunfishcode commented 2 years ago

It would seem that anything that would incorporate NFC, NFD, or Casefold at any part of the ordering would seem to depend on a version of Unicode, and thus be nondeterministic.

Also, as I detailed here, I expect WASI-filesystem will contain a significant amount of nondeterminism in practice, so making this one aspect deterministic doesn't seem worth the significant complexity here.

indolering commented 2 years ago

It would seem that anything that would incorporate NFC, NFD, or Casefold at any part of the ordering would seem to depend on a version of Unicode, and thus be nondeterministic.

Reproducibility is the best any platform can hope to accomplish, as any Turing complete program with access to non-deterministic inputs can be made to run non-deterministically. Sorting by codepoint would not be version dependent, cover ~95% of use cases, and would be a LOT less work than implementing a VFS or ad-hoc patching individual programs.

Point taken WRT complexity, performance, and implementation costs. I was hoping to propose an Evergreen Unicode release strategy and contribute a sorting implementation, but medical issues have delayed this work. Ideally, things would be architected to allow someone with the budget/time to contribute a maximally normalizing sort algorithm later on.

But feel free to close and tag with wontfix if that is deemed out-of-scope. We've already spent a lot of time in this rabbit hole.

indolering commented 2 years ago

Yeah, let's kill this. Without the FS maintaining the index, we will get weird performance edge cases. It's just the wrong level of abstraction to be doing this on.