dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.24k stars 4.73k forks source link

[API Proposal]: string System.IO.Path.ToPosixPortable(string) #83718

Open pkar70 opened 1 year ago

pkar70 commented 1 year ago

Background and motivation

Using standard POSIX portable filenames (ISO 9945, XBD, clause 3.282) can be very useful, especially when files should be really portable. E.g. why use "błąd", or "łódź", if we can use "blad" or "lodz", and everything is still readable and can be understand correctly? If I want to create *m3u file, or something like that, I have to make change as described above. When one uses only POSIX portable filenames for resources accessible by URI, no URI encoding is neccessary. And standard is standard, so it would be nice to have simple way to use it.

But filenames can be constructed from other writing systems, e.g. Cyrillic, Greek... So we can first transliterate it to latin alphabet (ISO 843, ISO 9, and similar).

I've created such methods in my own library, but I think it is usable for other people also. I can create PR with these methods, but first I want to hear some feedback about proposition.

API Proposal

namespace System
{
public class string 
{
    public string TransliterateCyrilicToLatin(); // or, maybe, TransliterateFromCyrilic, ISO 9
    public string TransliterateGreekToLatin(); // or, maybe, TransliterateFromGreek, ISO 843
 // ... other similar ISO standards
   public string DropAccents();
   public string ToPOSIXportableCharacters(string replacement = "_"); // ISO 9945 XBD 6.1
}
}

namespace System.IO.Path
{
   public string ToPOSIXportableFilename(string replacement = "_"); // ISO 9945 XBD 3.282
}

API Usage


 IO.File.Write( IO.Path.Combine(folderPath, IO.Path.ToPOSIXportableFilename(userProposedFilename)));

Alternative Designs

No response

Risks

No risks of "breaking" type, as these are new methods. Only risk is to make such methods consistent with changes in ISO standards (changes in existing standards, or new standards - with new transliterations).

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Using standard POSIX portable filenames (ISO 9945, XBD, clause 3.282) can be very useful, especially when files should be really portable. E.g. why use "błąd", or "łódź", if we can use "blad" or "lodz", and everything is still readable and can be understand correctly? If I want to create *m3u file, or something like that, I have to make change as described above. When one uses only POSIX portable filenames for resources accessible by URI, no URI encoding is neccessary. And standard is standard, so it would be nice to have simple way to use it. But filenames can be constructed from other writing systems, e.g. Cyrillic, Greek... So we can first transliterate it to latin alphabet (ISO 843, ISO 9, and similar). I've created such methods in my own library, but I think it is usable for other people also. I can create PR with these methods, but first I want to hear some feedback about proposition. ### API Proposal ```csharp namespace System { public class string { public string TransliterateCyrilicToLatin(); // or, maybe, TransliterateFromCyrilic, ISO 9 public string TransliterateGreekToLatin(); // or, maybe, TransliterateFromGreek, ISO 843 // ... other similar ISO standards public string DropAccents(); public string ToPOSIXportableCharacters(string replacement = "_"); // ISO 9945 XBD 6.1 } } namespace System.IO.Path { public string ToPOSIXportableFilename(string replacement = "_"); // ISO 9945 XBD 3.282 } ``` ### API Usage ```csharp IO.File.Write( IO.Path.Combine(folderPath, IO.Path.ToPOSIXportableFilename(userProposedFilename))); ``` ### Alternative Designs _No response_ ### Risks No risks of "breaking" type, as these are new methods. Only risk is to make such methods consistent with changes in ISO standards (changes in existing standards, or new standards - with new transliterations).
Author: pkar70
Assignees: -
Labels: `api-suggestion`, `area-System.IO`, `untriaged`
Milestone: -
MichalPetryka commented 1 year ago

ICU exposes APIs for this, so the main question here would be whether it's worth to have managed APIs for it and their exact shape.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Using standard POSIX portable filenames (ISO 9945, XBD, clause 3.282) can be very useful, especially when files should be really portable. E.g. why use "błąd", or "łódź", if we can use "blad" or "lodz", and everything is still readable and can be understand correctly? If I want to create *m3u file, or something like that, I have to make change as described above. When one uses only POSIX portable filenames for resources accessible by URI, no URI encoding is neccessary. And standard is standard, so it would be nice to have simple way to use it. But filenames can be constructed from other writing systems, e.g. Cyrillic, Greek... So we can first transliterate it to latin alphabet (ISO 843, ISO 9, and similar). I've created such methods in my own library, but I think it is usable for other people also. I can create PR with these methods, but first I want to hear some feedback about proposition. ### API Proposal ```csharp namespace System { public class string { public string TransliterateCyrilicToLatin(); // or, maybe, TransliterateFromCyrilic, ISO 9 public string TransliterateGreekToLatin(); // or, maybe, TransliterateFromGreek, ISO 843 // ... other similar ISO standards public string DropAccents(); public string ToPOSIXportableCharacters(string replacement = "_"); // ISO 9945 XBD 6.1 } } namespace System.IO.Path { public string ToPOSIXportableFilename(string replacement = "_"); // ISO 9945 XBD 3.282 } ``` ### API Usage ```csharp IO.File.Write( IO.Path.Combine(folderPath, IO.Path.ToPOSIXportableFilename(userProposedFilename))); ``` ### Alternative Designs _No response_ ### Risks No risks of "breaking" type, as these are new methods. Only risk is to make such methods consistent with changes in ISO standards (changes in existing standards, or new standards - with new transliterations).
Author: pkar70
Assignees: -
Labels: `api-suggestion`, `area-System.Globalization`, `untriaged`
Milestone: -
pkar70 commented 1 year ago

ICU exposes APIs for this, so the main question here would be whether it's worth to have managed APIs for it and their exact shape.

How exactly it is exposed? And what? Transliteration, or creating POSIX portable filename? I think we have two aspects here: transliteration is from System,Globalization, but POSIX portable filename is from System.IO.Path.

MichalPetryka commented 1 year ago

ICU exposes APIs for this, so the main question here would be whether it's worth to have managed APIs for it and their exact shape.

How exactly it is exposed? And what? Transliteration, or creating POSIX portable filename? I think we have two aspects here: transliteration is from System,Globalization, but POSIX portable filename is from System.IO.Path.

It exposes transliteration and for creating a portable filename, you'd need to transliterate it first.

pkar70 commented 1 year ago

How can I use it in C# code?

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Using standard POSIX portable filenames (ISO 9945, XBD, clause 3.282) can be very useful, especially when files should be really portable. E.g. why use "błąd", or "łódź", if we can use "blad" or "lodz", and everything is still readable and can be understand correctly? If I want to create *m3u file, or something like that, I have to make change as described above. When one uses only POSIX portable filenames for resources accessible by URI, no URI encoding is neccessary. And standard is standard, so it would be nice to have simple way to use it. But filenames can be constructed from other writing systems, e.g. Cyrillic, Greek... So we can first transliterate it to latin alphabet (ISO 843, ISO 9, and similar). I've created such methods in my own library, but I think it is usable for other people also. I can create PR with these methods, but first I want to hear some feedback about proposition. ### API Proposal ```csharp namespace System { public class string { public string TransliterateCyrilicToLatin(); // or, maybe, TransliterateFromCyrilic, ISO 9 public string TransliterateGreekToLatin(); // or, maybe, TransliterateFromGreek, ISO 843 // ... other similar ISO standards public string DropAccents(); public string ToPOSIXportableCharacters(string replacement = "_"); // ISO 9945 XBD 6.1 } } namespace System.IO.Path { public string ToPOSIXportableFilename(string replacement = "_"); // ISO 9945 XBD 3.282 } ``` ### API Usage ```csharp IO.File.Write( IO.Path.Combine(folderPath, IO.Path.ToPOSIXportableFilename(userProposedFilename))); ``` ### Alternative Designs _No response_ ### Risks No risks of "breaking" type, as these are new methods. Only risk is to make such methods consistent with changes in ISO standards (changes in existing standards, or new standards - with new transliterations).
Author: pkar70
Assignees: -
Labels: `api-suggestion`, `area-System.IO`, `untriaged`
Milestone: -
MichalPetryka commented 1 year ago

Currently you'd need to PInvoke into ICU, the native library that dotnet uses for handling Globalization. You'd need to consult its documentation to see how the exact usage looks.

tarekgh commented 1 year ago

Currently you'd need to PInvoke into ICU, the native library that dotnet uses for handling Globalization. You'd need to consult its documentation to see how the exact usage looks.

I am not sure if this is a good idea. Native globalization layer is limited to a specific set of ICU APIs. I don't think we used any transliteration APIs.

"błąd", or "łódź", if we can use "blad" or "lodz", and everything is still readable and can be understand correctly?

This looks suspicious to me. What happens when you have two files like błąd and blad? how you can handle that?

MichalPetryka commented 1 year ago

I am not sure if this is a good idea. Native globalization layer is limited to a specific set of ICU APIs. I don't think we used any transliteration APIs.

Well the whole point of the proposal is to expose transliteration too.

Clockwork-Muse commented 1 year ago

.... and, are we sure that all transliteration characters are valid for POSIX filenames? ICU lists that ~ will be used for some parts of transliteration for Japanese->Latin alphabets, which isn't valid for at least z/OS.

MichalPetryka commented 1 year ago

.... and, are we sure that all transliteration characters are valid for POSIX filenames? ICU lists that ~ will be used for some parts of transliteration for Japanese->Latin alphabets, which isn't valid for at least z/OS.

Well I assume that in this case ICU would be used to transliterate to ASCII only and then unsupported characters would just be replaced with replacement.

tarekgh commented 1 year ago

Well the whole point of the proposal is to expose transliteration too.

This makes sense if the request is to expose managed API. @pkar70 was asking how to work around the issue through writing C# code to call the transliteration when you mentioned ICU support that.

MichalPetryka commented 1 year ago

Well the whole point of the proposal is to expose transliteration too.

This makes sense if the request is to expose managed API. @pkar70 was asking how to work around the issue through writing C# code to call the transliteration when you mentioned ICU support that.

    public string TransliterateCyrilicToLatin(); // or, maybe, TransliterateFromCyrilic, ISO 9
    public string TransliterateGreekToLatin(); // or, maybe, TransliterateFromGreek, ISO 843
 // ... other similar ISO standards
   public string DropAccents();

The proposal does seem to express that intent.

pkar70 commented 1 year ago

I mean something like that: https://github.com/pkar70/MyLibs/blob/19451e4f430303d7573c5eb837293596bf9baf77/Nuget-Extensions/dotnetextensions.vb#L233

(I prefer writing code in VB, so my own library is in VB; but I think it is understandable :) )

I don't resolve problem: should transliteration be called while making filename 'POSIX portable filename character set' compliant? If yes, than it should first 'drop accents', then 'transliterate', then my current implementation of ToPOSIX...

jozkee commented 1 year ago

I think the transliteration APIs are out of IO scope, but the System.IO.Path.ToPOSIXportableFilename proposal seems reasonable. We should consider addressing it together with https://github.com/dotnet/runtime/issues/25010 and https://github.com/dotnet/runtime/issues/25011.

pkar70 commented 1 year ago

I didn't know about 25010, 25011 :) Some other observations: ";" in filenames is known to separate filename from file version (and some CD mastering programs adds ";1" after each file). Space in filenames are valid in many file systems, but using command line utilities is cumbersome (Windows requires quotes, POSIX systems (as Linux) requires prefixing space with backslash). Using characters from outside standard leads to many errors while using not-Unicode based OS calls, or e.g. not-Unicode terminals (like "telnet", cmd.exe, etc.)

Transliteration is out of IO scope, but using it in converting filename to POSIX compliant would be beneficial.