API Proposal: Add Path API that validates that path cross-plat safe

JeremyKuhne commented 6 years ago

Essentially this is similar to the Windows Path.GetInvalidFileNameChars().

namespace System.IO
{
    public static class Path
    {
        public static bool IsPortableFileName(ReadonlySpan<char> name);
    }
}

This API will not accept

invalid Windows filename characters (control chars, <, >, :, ", /, \, |)
starting or ending spaces
ending periods .
starting dashes -
normal globbing characters [, ], *, ?

See related dotnet/runtime#25010

jkotas commented 6 years ago

What is the scenario where one would want to use this API?

JeremyKuhne commented 6 years ago

What is the scenario where one would want to use this API?

It is intended to provide a way to validate user input for names.

iSazonov commented 6 years ago

We could use this in PowerShell Core to support writing portable scripts.

jnm2 commented 6 years ago

What if I'm generating the filename automatically? What would I do if this method returned false?

iSazonov commented 6 years ago

@jnm2 This means that your automatic algorithm is wrong. You must ensure that your algorithm creates portable file names.

jnm2 commented 6 years ago

You must ensure that your algorithm creates portable file names.

Suppose the text was coming from a URL, a saved title, or some existing list of files on the file system? I'd have to adapt the names. Looks like the best option would be to loop over the chars and check IsPortableFileName for each char in the string.

iSazonov commented 6 years ago

In the case we could have ConvertToPortableFileName() and/or ConvertToPortablePath() - it is exclude duplicated operations.

jkotas commented 6 years ago

The Posix portable filenames defined in the related issue make sense. It is the truly portable set. I think that the defining .NET-specific concept of portable filenames is doomed to failure.

I happened to run a case study on this over the long weekend by accident: I was showing my son how to build and publish ASP.NET pages. He named his ASP.NET project Webová stránka. It is portable filename according to the definition above, with non-ascii characters and space. He was hitting weird errors, searching for workarounds for them, very frustrating experience. Once we changed the project name to ascii only and no spaces (ie the Posix portable), everything worked like a charm.

iSazonov commented 6 years ago

We can look at this from the other side. Does any OS enable Posix-only-names mode? No. And it is rather said about what exactly posix-only API is doomed to failure. Users will want to use locale names. And it is rather said the success of this API.

In general, I believe .Net team could take a closer look how PowerShell works with paths. Now it has been ported. And it has a lot of problems because of .Net API limitations. We make globbing, do normalization, do parsing, use the rules for converting paths so that they work on all platforms, we are forced to use p/invokes to work with soft/hard links, do enumerations with detecting cycles and so on. We pay dearly for these things consuming too many resources. PowerShell could work many times faster if .Net API would be more advanced.

All that useful for PowerShell will be useful for other applications too.

danmoseley commented 6 years ago

We make globbing, do normalization, do parsing, use the rules for converting paths so that they work on all platforms, we are forced to use p/invokes to work with soft/hard links, do enumerations with detecting cycles and so on. We pay dearly for these things consuming too many resources. PowerShell could work many times faster if .Net API would be more advanced.

@isazonov are you willing to open API proposals for these? Or point to existing ones. It sounds like it would be worth doing given the pain you are having. We have added plenty of new IO API and can add more.

iSazonov commented 6 years ago

I am afraid that my experience (in PowerShell and .Net internals) is not enough so I appeal to MSFT experts in .Net and PowerShell teams. MSFT has no other shell than PowerShell in the development of which MSFT would be interested more. As for me I'll definitely be working on this at least in small steps.

JeremyKuhne commented 6 years ago

@iSazonov any time you run across code in PowerShell (or anywhere for that matter) that you think there might be something we can do to help in Core, don't hesitate to open an issue and tag me. I'm always happy to take a look- having practical examples is super helpful for me to identify issues and prioritize work.

@jkotas I assume the problems were with exposing the names on the web? Or were you hitting filesystem issues?

Perhaps the right answer is to provide some sort of enum options that make it easier to identify what the target is and what it means. Something like:

namespace System.IO
{
    public enum PortabilityTarget
    {
        // Very limited, most portable
        Posix,

        // When you want Unicode and want to save files
        FileSystem,

        // Web specific restraints?
    }

    public static class Path
    {
        public static bool IsPortableFileName(ReadOnlySpan<char> name, PortabilityTarget target);

        // Replaces all invalid chars with underscore
        public static string MakePortableFileName(ReadOnlySpan<char> name, PortabilityTarget target);
    }
}

jkotas commented 6 years ago

I assume the problems were with exposing the names on the web?

The website URL got mangled and registered in the DNS correctly. The app behind the website failed to startup very early with 502 error. I think it was most likely file system access problem. I am not able to tell for sure since I was not able to diagnose it. We have clicked "report a problem" link , so hopefully somebody takes a look at this.

jkotas commented 6 years ago

It would be useful to define what portable filename means in context of this API. The proposal says how the API is going to be implemented, but it does not say what portable filename is.

My definition of portable filename would be "a filename that is not going to hit bugs or limitations in programs out there, across different operating systems".

iSazonov commented 6 years ago

In PowerShell we implement some providers to work with different targets, in particular, two similar provider FileSystem and Registry. Someone could implement other providers like WebDAV or FTP. We discussed normalization problem and we came to the conclusion that each provider must implement its normalization method. In context of the issue we could consider Path more generic then file path. In the case I believe we should consider System.Uri schemes as PortabilityTargets. The only problem with System.Uri that I found when analyzing using this in PowerShell is that System.Uri don't support \\.\ (dosdevice namespace) and \\?\ formats - why? - while in PowerShell we have to parse the schemes. And possibly second problem is that System.Uri is not extendable for new schemas.

AnthonyMastrean commented 3 years ago

We are extremely interested in this API proposal since we're creating files on a Linux application server which may be downloaded on all client platforms (Windows, macOS, Linux, etc.) via a website or HTTP API. I would like to see an optimized "make portable file name" API, as well.

danmoseley commented 3 years ago

@AnthonyMastrean thanks for datapoint. how would you define portable here -- given OS may have one of many file systems -- would it be the strict intersection of all?

If the implementation was essentially requiring something like [A-Za-z_!#,....] and a few other symbols (ie., it was a conservative list) would that meet your needs, or is this customer input and they would expect eg non ASCII chars?

Clockwork-Muse commented 3 years ago

or is this customer input and they would expect eg non ASCII chars?

To make this point more clear, remember that various East Asian countries are quite attached to their non-ASCII character sets, given that's what they use in everyday life. You might have a /home/わたし/ ("watashi" -> I/me) directory, for example...

danmoseley commented 3 years ago

@Clockwork-Muse and understandably so. I guess I was wondering whether the intersection of what every file system and relevant OS supports might be that small, and if so would it even be useful (eg., if the name was internal implementation details rather than arbitrary input).

I guess there's too much speculation wrapped up together there.

iSazonov commented 3 years ago

There is io.path.InvalidPathChars for every platform - if we combine all them we could get a rule enough to get cross-platform paths.

AnthonyMastrean commented 3 years ago

@danmoseley We're thinking of the strictest intersection... We provide single file downloads that must have portable file names. We also provide multiple files in a ZIP archive download. So, the ZIP archive file name and the archive entry names must be portable.

Some archive clients are "smarter" than others and may automatically fix file names for the target OS/filesystem, but we need to support the dumbest clients. We are avoiding any folder structures inside the ZIP archive, for now, to delay handling portable file paths.

The file names include parts of the domain model that are sourced from external systems that may be created by user input. The application domain is networking and network devices. Some examples of domain model properties that would appear in file names are:

IP addresses will be used in single file and ZIP archive file names. IP address without any CIDR notation (like /32) is safe in the strictest interpretation. The IPAddress type's ToString is already safe to use here.
```
var fileName = $"{hostAddress}.zip";
```
Logical system names may be used in single file and ZIP archive entry names. The naming spec for these differs by vendor and even OS version. Organizations may apply their own naming scheme on top of this. Both the spec and scheme are likely to be less restrictive than the strictest file name interpretation.
```
foreach (var logicalSystem in logicalSystems)
{
    var zipEntryName = $"{logicalSystem}.txt";
    ...
}
```

We're fine fixing file names by replacing disallowed characters with, say, an underscore _.

MakePortableFileName($"{logicalSystem}.txt");

Clockwork-Muse commented 3 years ago

Both the spec and scheme are likely to be less restrictive than the strictest file name interpretation.

... I can guarantee you there's at least one customer with non-ASCII characters in a system name somewhere.

We're fine fixing file names by replacing disallowed characters with, say, an underscore _.

You are checking for name collisions due to this, right? Especially if you're replacing more than just a few oddball characters.

AnthonyMastrean commented 3 years ago

@Clockwork-Muse

I can guarantee you there's at least one customer with non-ASCII characters in a system name somewhere.

Yeah, that's why we want to fix these names when included in cross-platform file names. We understand that the portable OS/filesystem specification is likely to be the most restrictive specification.

You are checking for name collisions due to this, right?

Nope, but that's a good point! 😬 (maybe we will leverage the archive client support to handle duplicates in this case)

Clockwork-Muse commented 3 years ago

(maybe we will leverage the archive client support to handle duplicates in this case)

That very well may be too late - the collisions I'm referring to are likely to occur when writing the zipfile itself (assuming that an identical zip entry overwrites the other - I'm not sure about in C#, but I believe this occurs in several zip libraries).

The other problem is that even if there aren't actual collisions, replacement of too many "reasonable" characters is likely to result in indistinguishable/mysterious filenames. You may need to require your consumers to include safe names that can be used, if those are the constraints you're operating under.

AnthonyMastrean commented 3 years ago

@Clockwork-Muse The .NET ZipArchive doesn't mind entries with duplicate names. The general point stands, though... it's difficult to fully specify a "MakePortableFileName" method, especially when you consider creating multiple portable output names from an input list.

I'm willing to compromise in my specific application. If the user input (however indirect) differs only on "restricted" characters in the same positions, we'll create mysterious duplicates and pray the client software offers features to at least sequence the files and not overwrite (or offers the user a prompt to choose between the two options)

We could pre- or postprocess the entry name list and provide sufficient warnings that this situation is occurring and offer domain-specific advice for correcting the situation.

Model:

10.0.0.1
├───logical-system-name!
└───logical-system-name+

Portable archive:

10.0.0.1.zip
├───logical-system-name_.txt
└───logical-system-name_.txt

Filesystem after extraction:

10.0.0.1/
├───logical-system-name_.txt
└───logical-system-name_ (1).txt

dotnet / runtime

API Proposal: Add Path API that validates that path cross-plat safe #25011