JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.48k forks source link

`readbytes!` method for element types other than UInt8 #47413

Open apaz-cli opened 2 years ago

apaz-cli commented 2 years ago

In trying to fix #47404, we noticed that it's really hard to read UTF16 data from files. Long ago there used to be utf16() and utf32() functions to do the conversion, but they were deprecated and removed.

The prevailing narrative is "UTF8 everywhere," and I agree wholeheartedly. But, reading UTF16 and UTF32 data, especially from a file, and converting to UTF8 ought to be easier.

vtjnash commented 2 years ago

For minor clarification, the conversions are still easy (just transcode), but we do not appear to have an equivalent of readbytes! for anything except UInt8, which means even other primitive bits type (Int8, UInt16, etc.) have no simple way to slurp in from a file until EOF. And if we do read it in as bytes, it is quite awkward (as well as feeling unnecessary) to copy it over into a UInt16 buffer in preparation for calling transcode.

PallHaraldsson commented 2 years ago

Where/what do you need to read in UTF-16 for Julia itself (or UTF-32?!)? Note there are also packages for those, if for outside of Julia.

gbaraldi commented 2 years ago

There is an operating system called Windows unfortunately :)

PallHaraldsson commented 2 years ago

Yes, I mean where exactly in Julia itself (which file) you need to read such an UTF-16 file. :) I mean its contents (file names are always UTF-16 in the file-system, contents need not be).

PallHaraldsson commented 2 years ago

Julia opens file with (on Windows): https://github.com/JuliaLang/julia/blob/8a9589d5a5d9f6bbc3dd8cbcdfb93fa03527c796/src/support/ios.c#L950

deprecated because more-secure versions are available; see _sopen_s, _wsopen_s

It (and the replacement functions) have options, not used:

_O_U16TEXT | Opens a file in Unicode UTF-16 mode. _O_U8TEXT | Opens a file in Unicode UTF-8 mode. _O_WTEXT | Opens a file in Unicode mode.

I'm not sure if that applies or: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170

ccs=encoding | Specifies the encoded character set to use (one of UTF-8, UTF-16LE, or UNICODE) for this file.

I'm not sure, but some of these options may only be available on recent (i.e. Windows 10+) versions.

apaz-cli commented 2 years ago

Do all non-windows platforms that we support have iconv()? It was standardized in POSIX 2001. If yes it should be relatively simple to implement, depending on what the API should be. But grepping through the repo, I see that there's a libiconv jll, and libcurl doesn't assume its presence either.

StefanKarpinski commented 2 years ago

Based on @vtjnash's comment, maybe the simplest solution is to support reading data in as Vector{UInt16} and Vector{UInt32} until EOF? Of course then there's the question of what to do if there are dangling bytes.

vtjnash commented 2 years ago

yeah, it is unclear if we should error or drop them, but either would probably be okay

giordano commented 2 years ago

Error may be the safest option as the error can be later made something else if a more sensible alternative emerges?

JeffBezanson commented 2 years ago

We could add a method to readbytes! --- the name doesn't seem ideal but it is still reading bytes from the file...

vtjnash commented 2 years ago

On slack (#triage) there was some agreement that transcode(String, reinterpret(UInt16, read(io))) may be the current best strategy for reading a UTF-16 file into a String, though it currently lacks a measure of discoverability (e.g. it seems to not be an obvious way to get this data from the file).

And just a secondary note also that write does not share this problem, since the ambiguous operation does not occur of "write as much of this array as possible". Thus write(io, rand(UInt16, 4)) is already the complement of both read!(stdin, zeros(UInt16, 4)) and readbytes!