Open apaz-cli opened 2 years ago
For minor clarification, the conversions are still easy (just transcode
), but we do not appear to have an equivalent of readbytes!
for anything except UInt8
, which means even other primitive bits type (Int8, UInt16, etc.) have no simple way to slurp in from a file until EOF. And if we do read it in as bytes, it is quite awkward (as well as feeling unnecessary) to copy it over into a UInt16 buffer in preparation for calling transcode.
Where/what do you need to read in UTF-16 for Julia itself (or UTF-32?!)? Note there are also packages for those, if for outside of Julia.
There is an operating system called Windows unfortunately :)
Yes, I mean where exactly in Julia itself (which file) you need to read such an UTF-16 file. :) I mean its contents (file names are always UTF-16 in the file-system, contents need not be).
Julia opens file with (on Windows): https://github.com/JuliaLang/julia/blob/8a9589d5a5d9f6bbc3dd8cbcdfb93fa03527c796/src/support/ios.c#L950
deprecated because more-secure versions are available; see _sopen_s, _wsopen_s
It (and the replacement functions) have options, not used:
_O_U16TEXT | Opens a file in Unicode UTF-16 mode. _O_U8TEXT | Opens a file in Unicode UTF-8 mode. _O_WTEXT | Opens a file in Unicode mode.
I'm not sure if that applies or: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170
ccs=encoding | Specifies the encoded character set to use (one of UTF-8, UTF-16LE, or UNICODE) for this file.
I'm not sure, but some of these options may only be available on recent (i.e. Windows 10+) versions.
Do all non-windows platforms that we support have iconv()
? It was standardized in POSIX 2001. If yes it should be relatively simple to implement, depending on what the API should be. But grepping through the repo, I see that there's a libiconv jll, and libcurl doesn't assume its presence either.
Based on @vtjnash's comment, maybe the simplest solution is to support reading data in as Vector{UInt16}
and Vector{UInt32}
until EOF? Of course then there's the question of what to do if there are dangling bytes.
yeah, it is unclear if we should error or drop them, but either would probably be okay
Error may be the safest option as the error can be later made something else if a more sensible alternative emerges?
We could add a method to readbytes!
--- the name doesn't seem ideal but it is still reading bytes from the file...
On slack (#triage) there was some agreement that transcode(String, reinterpret(UInt16, read(io)))
may be the current best strategy for reading a UTF-16 file into a String, though it currently lacks a measure of discoverability (e.g. it seems to not be an obvious way to get this data from the file).
And just a secondary note also that write
does not share this problem, since the ambiguous operation does not occur of "write as much of this array as possible". Thus write(io, rand(UInt16, 4))
is already the complement of both read!(stdin, zeros(UInt16, 4))
and readbytes!
In trying to fix #47404, we noticed that it's really hard to read UTF16 data from files. Long ago there used to be
utf16()
andutf32()
functions to do the conversion, but they were deprecated and removed.The prevailing narrative is "UTF8 everywhere," and I agree wholeheartedly. But, reading UTF16 and UTF32 data, especially from a file, and converting to UTF8 ought to be easier.