Open pkar70 opened 1 year ago
Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.
Author: | pkar70 |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.IO`, `untriaged` |
Milestone: | - |
ICU exposes APIs for this, so the main question here would be whether it's worth to have managed APIs for it and their exact shape.
Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.
Author: | pkar70 |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Globalization`, `untriaged` |
Milestone: | - |
ICU exposes APIs for this, so the main question here would be whether it's worth to have managed APIs for it and their exact shape.
How exactly it is exposed? And what? Transliteration, or creating POSIX portable filename? I think we have two aspects here: transliteration is from System,Globalization, but POSIX portable filename is from System.IO.Path.
ICU exposes APIs for this, so the main question here would be whether it's worth to have managed APIs for it and their exact shape.
How exactly it is exposed? And what? Transliteration, or creating POSIX portable filename? I think we have two aspects here: transliteration is from System,Globalization, but POSIX portable filename is from System.IO.Path.
It exposes transliteration and for creating a portable filename, you'd need to transliterate it first.
How can I use it in C# code?
Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.
Author: | pkar70 |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.IO`, `untriaged` |
Milestone: | - |
Currently you'd need to PInvoke into ICU, the native library that dotnet uses for handling Globalization. You'd need to consult its documentation to see how the exact usage looks.
Currently you'd need to PInvoke into ICU, the native library that dotnet uses for handling Globalization. You'd need to consult its documentation to see how the exact usage looks.
I am not sure if this is a good idea. Native globalization layer is limited to a specific set of ICU APIs. I don't think we used any transliteration APIs.
"błąd", or "łódź", if we can use "blad" or "lodz", and everything is still readable and can be understand correctly?
This looks suspicious to me. What happens when you have two files like błąd
and blad
? how you can handle that?
I am not sure if this is a good idea. Native globalization layer is limited to a specific set of ICU APIs. I don't think we used any transliteration APIs.
Well the whole point of the proposal is to expose transliteration too.
.... and, are we sure that all transliteration characters are valid for POSIX filenames? ICU lists that ~
will be used for some parts of transliteration for Japanese->Latin alphabets, which isn't valid for at least z/OS.
.... and, are we sure that all transliteration characters are valid for POSIX filenames? ICU lists that
~
will be used for some parts of transliteration for Japanese->Latin alphabets, which isn't valid for at least z/OS.
Well I assume that in this case ICU would be used to transliterate to ASCII only and then unsupported characters would just be replaced with replacement
.
Well the whole point of the proposal is to expose transliteration too.
This makes sense if the request is to expose managed API. @pkar70 was asking how to work around the issue through writing C# code to call the transliteration when you mentioned ICU support that.
Well the whole point of the proposal is to expose transliteration too.
This makes sense if the request is to expose managed API. @pkar70 was asking how to work around the issue through writing C# code to call the transliteration when you mentioned ICU support that.
public string TransliterateCyrilicToLatin(); // or, maybe, TransliterateFromCyrilic, ISO 9
public string TransliterateGreekToLatin(); // or, maybe, TransliterateFromGreek, ISO 843
// ... other similar ISO standards
public string DropAccents();
The proposal does seem to express that intent.
I mean something like that: https://github.com/pkar70/MyLibs/blob/19451e4f430303d7573c5eb837293596bf9baf77/Nuget-Extensions/dotnetextensions.vb#L233
(I prefer writing code in VB, so my own library is in VB; but I think it is understandable :) )
I don't resolve problem: should transliteration be called while making filename 'POSIX portable filename character set' compliant? If yes, than it should first 'drop accents', then 'transliterate', then my current implementation of ToPOSIX...
I think the transliteration APIs are out of IO scope, but the System.IO.Path.ToPOSIXportableFilename
proposal seems reasonable.
We should consider addressing it together with https://github.com/dotnet/runtime/issues/25010 and https://github.com/dotnet/runtime/issues/25011.
I didn't know about 25010, 25011 :) Some other observations: ";" in filenames is known to separate filename from file version (and some CD mastering programs adds ";1" after each file). Space in filenames are valid in many file systems, but using command line utilities is cumbersome (Windows requires quotes, POSIX systems (as Linux) requires prefixing space with backslash). Using characters from outside standard leads to many errors while using not-Unicode based OS calls, or e.g. not-Unicode terminals (like "telnet", cmd.exe, etc.)
Transliteration is out of IO scope, but using it in converting filename to POSIX compliant would be beneficial.
Background and motivation
Using standard POSIX portable filenames (ISO 9945, XBD, clause 3.282) can be very useful, especially when files should be really portable. E.g. why use "błąd", or "łódź", if we can use "blad" or "lodz", and everything is still readable and can be understand correctly? If I want to create *m3u file, or something like that, I have to make change as described above. When one uses only POSIX portable filenames for resources accessible by URI, no URI encoding is neccessary. And standard is standard, so it would be nice to have simple way to use it.
But filenames can be constructed from other writing systems, e.g. Cyrillic, Greek... So we can first transliterate it to latin alphabet (ISO 843, ISO 9, and similar).
I've created such methods in my own library, but I think it is usable for other people also. I can create PR with these methods, but first I want to hear some feedback about proposition.
API Proposal
API Usage
Alternative Designs
No response
Risks
No risks of "breaking" type, as these are new methods. Only risk is to make such methods consistent with changes in ISO standards (changes in existing standards, or new standards - with new transliterations).