WebAssembly / wasi-filesystem

Filesystem API for WASI
Other
180 stars 19 forks source link

case-sensitivity of filesystem apis #5

Open programmerjake opened 5 years ago

programmerjake commented 5 years ago

I think it would be a good idea to standardise on if the file-system APIs should be case-sensitive or not and to what extent implementations should enforce case-sensitive/insensitive file-system access on common platforms such as Linux or Windows.

See https://github.com/CraneStation/wasmtime/pull/235#issuecomment-518897502 for some ideas.

tschneidereit commented 5 years ago

As mentioned in the linked Wasmtime PR, I don't think we can require either case-sensitivity or case-insensitivity, unfortunately.

@programmerjake mentioned a potential way to at least help with portability issues (see also the consecutive discussion). It seems like that might be viable, but would still incur some overhead. It also seems like it's more of a lint, and should perhaps be something an implementation can warn about, if it chooses to.

sunfishcode commented 5 years ago

I agree with @tschneidereit. The goal of the filesystem APIs is to enable access to filesystems, so we have to work within what they give us. So we can't depend on case sensitivity, or insensivity, or even a specific version of Unicode's case tables being used.

The technique @programmerjake mentioned here is a good idea to help mitigate the portability issues. Another thing we can do is encourage engines to have developer modes that occasionally scan for files that differ only in case.

And as an aside, another possibility, which we could do independently of the above, is to create a concept of WASI-specific directories, which would be directories where we allow ourselves to assume that only WASI programs will access the files. That would allow us to mangle names to avoid case sensitivity, and potentially implement other custom filesystem semantics.

programmerjake commented 5 years ago

I agree with @tschneidereit. The goal of the filesystem APIs is to enable access to filesystems, so we have to work within what they give us. So we can't depend on case sensitivity, or insensivity, or even a specific version of Unicode's case tables being used.

I'm not sure, but I think NTFS is case insensitive for just ASCII characters and it is case sensitive for non-ASCII characters.

Serentty commented 5 years ago

@programmerjake It's unfortunately a lot more complicated than that. From what I understand from my friends way more knowledgeable about Windows than I am, the way Windows handles case mapping is that it establishes a set of case mappings when you format a drive, based on what is current in Unicode at the time (or at least what Microsoft has gotten around to telling Windows about). So, take this with a grain of salt because this is second-hand information, but I think that two different hard drives might have different case mapping rules depending on when they were formatted, and two files with names considered equivalent on one may not be considered equivalent on the other. To make matters worse, due to legacy reasons, Windows is only case sensitive for characters within the BMP; characters outside of it are never case sensitive.

bjorn3 commented 5 years ago

What about only allowing to create a file when there is no other file with different case? That would prevent creating files with different case on case-insensitive systems, while still being able to open differently cased files on case-sensitive systems.

indolering commented 4 years ago

I've done a deep dive on this and I think case-insensitivity would be the right path forward. Should I just write a proposal on this ticket or would y'all perfer an RFC?

sunfishcode commented 4 years ago

The difficulty we perceive with plain case-insensitivity is that it's inefficient to emulate on case-sensitive hosts. An implementation on a case-sensitive host would have to scan the parent directory on every access to find case-insensitive matches. So for example, creating N files within a directory would take O(N^2) time. Or they could try every possible case variation of a name, which is O(2^N) with the length of filenames. If anyone knows of a way to avoid this, we'd love to hear it :-).

I've been imagining that the main WASI filesystem APIs will end up just saying that directories may or may not be case-insensitive, that case-insensitivity may be interpreted with arbitrary Unicode versions and/or optionally applied just to the ASCII or BMP subsets, and they may or may not apply any version of NFC or NFD normalization they chose. This unfortunately passes a lot of complexity up into applications, so it'd be great if anyone has better alternatives.

We might be able to do some things to help such as this, perhaps for normalization as well as case, which should hopefully help with portability, though that won't fix all the problems.

indolering commented 4 years ago

Or they could try every possible case variation of a name, which is O(2^N) with the length of filenames.

No, you just do a comparison of the casefolded filename. So it would be O(log files) on casefold-aware files systems (because they store the casefolded filename in the b-tree) or O(files) (where the limiting factor will be listing the files of a directory and then performing a save operation).

I'm going to write up my thoughts and post in a minute : )

indolering commented 4 years ago

So hi! I'm indolering, a security focused usability engineer. There are a lot of points to address here, so I’ll leave a summary below and link to the full explainer.

Case-insensitivity is required by end users: Windows, OS X, and Android all enforce some level of case-insensitivity - even Linux supports it!. They are all slightly different but, as Ted Tso put it, “the world is converging enough that the latest versions of Mac OS X’s APFS and Windows NTFS behave pretty much the same way.”

The feared complexity of case-insensitivity is unwarranted, as caseless matching is a pure function mapping from single codepoints to case-folded variant(s). This behavior is immutable for any assigned codepoint: any non-determinism caused by outdated Unicode tables can be caught at runtime, but would be vanishingly rare in practice.

The linked proposal for a versionless Unicode case-folding is modeled on Rusts overflow handling: allow for precisely defined implementation dependent behavior that can be deterministic, but mostly “just works” when backed by commodity filesystems.

It requires additional complexity, but the payoff is happy developers and users.

indolering commented 4 years ago

Updates based on chat:

If we can't assume WASI only filesystem access, then we should just fallback to whatever the filesystem does. Linux now supports case-insensitive on a per-directory basis, so case-insensitive matching will be fast and simple most of the time.

programmerjake commented 4 years ago

oops... closed by mistake

Serentty commented 3 years ago

Case-insensitivity is required by end users: Windows, OS X, and Android all enforce some level of case-insensitivity - even Linux supports it!.

The fact that so many systems do it doesn’t seem like good evidence of that to me. End users who would be confused by case sensitivity are mostly interacting with dialogue boxes which can present a case-insensitive search anyway, which most GUIs on Linux do. It doesn’t seem to me like this is a good argument for the filesystem itself to need this.

They are all slightly different but, as Ted Tso put it, “the world is converging enough that the latest versions of Mac OS X’s APFS and Windows NTFS behave pretty much the same way.” The feared complexity of case-insensitivity is unwarranted, as caseless matching is a pure function mapping from single codepoints to case-folded variant(s). This behavior is immutable for any assigned codepoint: any non-determinism caused by outdated Unicode tables can be caught at runtime, but would be vanishingly rare in practice.

The complexity here isn’t in matching the Unicode standard—it’s in being deterministic across different platforms. “Pretty much the same way” isn’t deterministic, and can be used for fingerprinting.

Distros that care about usability will eventually adopt case-insensitivity, even if it is just for the home directories.

I sincerely doubt this. I have seen no indicating of interesting in this from any of them.

indolering commented 3 years ago

Case-insensitivity is required by end users: Windows, OS X, and Android all enforce some level of case-insensitivity - even Linux supports it!.

The fact that so many systems do it doesn’t seem like good evidence of that to me.

Programmers almost universally dislike this idea (and Unicode generally) but ... I'm struggling to find a kind way to respond without just flashing my UX credentials and saying "trust me" 😟. Apple and Windows built their environments based on user testing whereas Linux had to inherit the BoB model because the only thing that existed was ASCII.

At a minimum, the VAST majority of end-users will be on a system that forces case insensitivity natively. I think that's pretty good evidence of what user expectations will be, or at least they will get upset if they can't move files between the two systems.

End users who would be confused by case sensitivity are mostly interacting with dialogue boxes which can present a case-insensitive search anyway, which most GUIs on Linux do.

And it's a nightmare for developers. I don't have time to dig up the links right now and my memory might be a incorrect on the details, but Apple didn't implement case sensitivity on the early iOS file-system (because AFPS was first used on the iPod). So when things were opened up, developers are expected to enforce case insensitive matching manually. But most developers don't know this and eventually run into situations in which users can't backup files to iCloud or another computer and even experience data loss.

It doesn’t seem to me like this is a good argument for the filesystem itself to need this.

It can't be implemented anywhere else: Unix's have tried to do this at some other abstraction level for decades and it always blows up (read the comment section too). The filesystem is a namespace and if you don't enforce it at the namespace level you will run into collisions. That's why Linux introduced case insensitive matching over the objections of Linus himself.

The complexity here isn’t in matching the Unicode standard—it’s in being deterministic across different platforms. “Pretty much the same way” isn’t deterministic, and can be used for fingerprinting.

If you want immutable determinism, then you are stuck with UTF8 and Korean users on a mac won't be able to type the same filename as Korean users on a PC.

An attacker can fingerprint a machine based on what version of WASM it is running. I have an idea for an Evergreen Unicode proposal that would eliminate Unicode version as a side channel as long as the system has been updated in the past ~year. However, I'm going through medical treatments right now and had to drop that effort, but it could be adopted at any time in the future even if it isn't adopted by Unicode proper.

Distros that care about usability will eventually adopt case-insensitivity, even if it is just for the home directories.

I sincerely doubt this. I have seen no indicating of interesting in this from any of them.

You might be right in that there is too much legacy stuff on Linux to support this as migrating to a case insensitive scheme could risk data loss. That's partly why I'm advocating a maximally conservative approach from the outset with a per-directory on/off switch for compatibility.

And if we are talking about some hypothetical WASM user environment running on an SeL4 microkernel (or whatever) ... I'm a usability engineer and I'm advocating for what I believe is the best user experience. Unfortunately, that usually requires altering the way the backend is built and why it's best to bring us in early before bad decisions are baked in.

Serentty commented 3 years ago

@indolering I’m very familiar with the “everything should be ASCII” mindset and it’s something I try to fight against when I see it, although I don’t personally find “things should be case sensitive” to be in the same spirit—although depending on what angle you’re coming at it from, I can see how other people might see it that way.

You might be right in that there is too much legacy stuff on Linux to support this as migrating to a case insensitive scheme could risk data loss.

This is definitely what I think. You mentioned home directories as the most likely candidate for receiving case sensitivity, but given that’s where people tend to store stuff like code projects, and lots of those won’t even compile on a case insensitive filesystem, I really can’t see it happening on Linux. I think things like normalizing filenames will be much easier achieve (and I think it should be done). Apple has done a very nice job with that.

And if we are talking about some hypothetical WASM user environment running on an SeL4 microkernel (or whatever) ...

Well, to be honest, the level of sandboxing and determinism that WASI seems to want seems unrealistic to me for all but the simplest programs. I’m arguing that the case insensitivity that you want couldn’t be done deterministically not because I think determinism is so important, but because of the extreme determinism that this project seems to want. I actually think that this issue is impossible to solve unless they make their own virtual filesystem on a disk image or something.

indolering commented 3 years ago

@indolering I’m very familiar with the “everything should be ASCII” mindset and it’s something I try to fight against when I see it

It's actually worse than that: the reason Unix used BoB for paths because they had to support code pages. So the meaning of the bytes would change depending on user environmental settings. NTFS and HFS had the advantage of being designed after Unicode existed, but it's hopeless to try and get the Unix's to evolve beyond BoB.

although I don’t personally find “things should be case sensitive” to be in the same spirit—although depending on what angle you’re coming at it from, I can see how other people might see it that way.

Some 99% of the general public store their files in a FS that is case insensitive ... don't you want to interoperate with those consumers. Linux developers already have to enforce case insensitivity informally on any project that they want Mac and Windows developers to be able to work on....

And also ... we aren't talking variable names here. Who actually uses case differences to make meaningful and intelligent demarcations in files? Is README.md supposed to contain just the important stuff while readme.md is the extended version? Would anyone ever sort files based on the capitalization of the parent directory?

No one has ever given me a good end-user scenario for why filenames shouldn't be case insensitive, only that it violates their mental model of what a string is programmatically.

I’m arguing that the case insensitivity that you want couldn’t be done deterministically not because I think determinism is so important, but because of the extreme determinism that this project seems to want.

I should just write up on RFC, because there are other design considerations beyond those raised in this discussion. We are basically trying to ensure that the semantics of path name resolution are portable and safe across WASI runtimes. A WASI module shouldn't need to worry about how Mac and Windows differ in their input handling or clever tricks hackers can use to confuse path resolution on different platforms.

At a minimum, that means we have to normalize the NFxy - which also means we can only be deterministic for reserved Unicode points. Many more codepoints are introduced that change under NFxy normalization than under toCasefold. We are basically talking about first nations alphabets that are just being created or archaic alphabets that are being added as part of cultural preservation efforts.

I actually think that this issue is impossible to solve unless they make their own virtual filesystem on a disk image or something.

I agree to an extent, but I'm short on time! I assure you, however, that this was considered and has little to do with case sensitivity.

sunfishcode commented 3 years ago

I'm expecting that one of the most important use cases for wasi-filesystem is giving users access to their existing files.

Accessing existing files on existing filesystems in a portable way is a complex problem. I've done a lot of research into filesystem behavior in an attempt to see if we could hide platform-specific path differences such as Unicode normalization, Unicode version, non-Unicode paths, case sensitivity, path length limits, path component length limits, number-of-component limits, supported characters, special names like "NUL" on Windows, trailing whitespace, trailing slashes (where the behavior differs even just between Unix platforms), whether ".." is resolved before or during the filesystem traversal, whether or not repeated slashes are coalesced before or after path length limits are enforced, and more, that are different between platforms. I've also looked into semantics, such as what happens if you rename or delete a file while it's opened, if you seek past the end of a file, if you rename a file over a directory, if you call fsync and care about what it actually does, if you write to a file and there's not enough storage space left, if you try to read a directory as a file, and a long list of other things that are different between filesystem implementations. It would take a small novel to describe how symlinks actually work. And then there's the whole universe of error-code differences.

There are things we can do, and some things we should do. But even if we do all the things, we're not going to be able to fix all of the problems, without being very slow, lossy, unreliable, or perhaps all of the above, when accessing existing files on existing filesystems.

There are both case-sensitive and case-insensitive filesystems in wide use, on systems that people want to run WASI programs on. Existing portability abstractions (langauge standard libraries, VMs, etc.) that I'm aware of don't hide these sensitivity differences, so existing portable applications are already expected to gracefully handle these differences.

There are use cases that want a high level of determinism or portability, however, I expect these applications will be better served by our focusing on other kinds of APIs anyway, such as database APIs, rather than trying to make filesystem APIs do everything.

As such, I propose that wasi-filesystem simply exposes the platform case sensitivity differences to applications as-is.

Serentty commented 3 years ago

I think that this is the most realistic option, really. I don’t think it can be truly deteministic across platforms. I think it could be useful to implement a flag to ask for case sensitivity or not when opening a file (with the default being whatever is “native”), but on a best-effort basis to ease porting in common cases, as opposed to matching the semantics of any platform, or providing completely deterministic results.

indolering commented 2 years ago

IIRC my work on this was in response to a suggestion of erroring if the filename returned wasn't an exact match. As long as we aren't going to do that, I think the option to try and smooth over platform differences should be left to those willing to do the work. Indeed, catching Unicode filename security fuckery is best left to a filesystem monitor, not the runtime.

It would be nice if case insensitive lookups were implemented, but medical issues have prevented me from contributing actual code here. Are there any blockers if someone showed up and wanted to add that feature? I think it would just be ensuring that FS lookup commands can be parameterized to allow non-default behavior?

sunfishcode commented 2 years ago

I think these kinds of changes are best left to individual implementations, as it's too complex to try to specify what the behavior should be at the spec level here. If an implementation is able to have a mode where it warns or logs or something if there's a case conflict between two paths, or a case difference causing a path to not match, that may be useful to users, but it's difficult to see how we might standardize such things here.

indolering commented 2 years ago

I think these kinds of changes are best left to individual implementations, as it's too complex to try to specify what the behavior should be at the spec level here.

I somewhat disagree, in that I think there is a right way to do things at this level of abstraction and it would be fairly easy to eliminate problem encountered by up to 90% of the population (Korean between Linux <-> Windows).

But I also don't think it's worth diverting resources from other features that could make WASM more competitive in other areas. IMHO it's also not an area WASM should expect to get right the first time: programmers-in-a-hurry at Apple, Microsoft, and Linux all tried to get this right and failed. Without on-boarding significantly more i18n expertise, it's probably best for this to get hashed out at Unicode, the W3C, or the IETF.

If I had had time to work on an implementation, I might be arguing differently. But until someone comes along with a working implementation that we can run past people at the IETF and Unicode, I vote to just do whatever the underlying filesystem dictates.

sunfishcode commented 2 years ago

I unfortunately don't think the IETF or Unicode will be able to help this particular issue for the foreseeable future. The big challenge for wasi-filesystem is that we want users to run programs on their existing files. Existing filesystems in the wild already have all kinds of normalization and case-sensitivity rules, and that won't change even if the standards bodies come out with new guidance on the right way to implement filesystems.