Open khwilliamson opened 2 weeks ago
My only issue with this is bytes_from_utf8() always pays the cost of an allocation, even when all of the characters are invariants, while the original code only pays that cost if an invariant is found.
I don't thing these hv functions have any business re-implementing utf8 translations. It's unfortunate the API is the way it is. I first tried the equivalent function utf8_to_bytes
which is destructive of the input. We know that the translated version will be no longer, and likely shorter than the original. So that should work. But I kept getting segfaults with what appeared to be valid writes. I came to the conclusion it must be writing to some shared or read-only memory.
So what to do. What strikes me immediately is to change the len parameter from STRLEN to SSize_t. I think we can get away with that without affecting existing code; but if I'm wrong this won't fly. If that is feasible, then a negative length would signal the function to just return the input if there are no variants. The caller would check if the return pointer is the same as the passed one.
Another option is to convert bool *is_utf8p
to an enum with an extra value that says turn off this flag and return the original source if everything is an invariant
That's not going to work either. How about a new function Size_t utf8_to_bytes_flags(U8 ** s, int type)
I have worked on this, and come with the following API proposal. Feedback welcome
"utf8_to_bytes_type"
"utf8_to_bytes"
"bytes_from_utf8"
NOTE: "utf8_to_bytes" is experimental and may change or be removed
without notice.
These each convert a string encoded as UTF-8 into the equivalent
native byte representation, if possible.
"utf8_to_bytes_type" has the more modern API, and is the easiest to
use. On input, "s_ptr" is a pointer to the string to be converted
(so that the first byte will be at *sptr[0]), and *lenp is its
length.
It returns non-zero if the end result is native bytes; zero if there
are code points in the string not expressible in native byte
encoding. In many situations, treating the return as a boolean is
sufficient, but it actually returns an enum "xxx" so you can tease
apart the reason it succeeded. "noop" means that the input already
was in bytes, so no action was necessary. "converted" means that the
conversion was successful. "cant_convert" is a synonym you can use
for 0 or "false".
In all cases, *s_ptr and *lenp will have correct and consistent
values, unchanged if the return is either 0 or "noop"; otherwise
updated as necessary.
If the return is "converted" and "type" is 0 (or the equivalent
"overwrite"), the converted value overwrote the input string. *s_ptr
will be unchanged, but *lenp will be updated to its new (shortened)
length.
If the return is "converted" and "type" is "preserve", the original
string is never changed. Instead a new "NUL"-terminated string is
allocated, and *s_ptr is changed to point to that new memory. The
caller is responsible for freeing that memory. "type" can also be
set to "use_temp". In this case the original string is also
preserved, and the converted string placed in newly allocated
memory, but that will have been marked for automatic destruction (by
using "SAVEFREEPV").
Note that the caller has to free the memory at *s_ptr if and only if
it called this function with "type" set to "preserve", and the
function returned "converted".
Upon successful return, the number of variants in the string can be
computed by having saved the value of *lenp before the call, and
subtracting the after-call value of *lenp from it. This is also true
for the other two functions described below.
Plain "utf8_to_bytes" also converts a UTF-8 encoded string to bytes,
but there are more glitches that the caller has to be prepared to
handle.
The input string is passed with one less indirection level, "s".
If the conversion was successful or a noop
The function returns "s" (unchanged) and *lenp will contain the
correct length.
If the conversion failed
The function returns NULL and sets *lenp to -1, cast to
"STRLEN". This means that you will have to use a temporary
containing the string length to pass to the function if you will
need the value afterwards.
"bytes_from_utf8" also converts a potentially UTF-8 encoded string
"s" to bytes. It preserves "s", allocating new memory for the
converted string.
In contrast to the other two functions, the input string to this one
need not be UTF-8. If not, the caller has set *is_utf8p to be
"false", and the function does nothing, returning the original "s".
Also do nothing if there are code points in the string not
expressible in native byte encoding, returning the original "s".
Otherwise, *is_utf8p is set to 0, and the return value is a pointer
to a newly created string containing the native byte equivalent of
"s", and whose length is returned in *lenp, updated. The new string
is "NUL"-terminated. The caller is responsible for arranging for the
memory used by this string to get freed.
The major problem with this function is that memory is allocated
and filled even when no conversion was necessary..
U8 utf8_to_bytes_type( U8 **s_ptr, STRLEN *lenp,
U32 type)
U8 * utf8_to_bytes ( U8 *s, STRLEN *lenp)
U8 * Perl_utf8_to_bytes (pTHX_ U8 *s, STRLEN *lenp)
U8 * bytes_from_utf8 ( const U8 *s, STRLEN *lenp,
bool *is_utf8p)
U8 * Perl_bytes_from_utf8 (pTHX_ const U8 *s, STRLEN *lenp,
bool *is_utf8p)
but that will have been marked for automatic destruction (by
using "SAVEFREEPV").
I didnt read the full commit, but the api looks flexible and efficient.
It covers 3 states. -input is output, array of U8s not changed -input was converted in place to output, and input contents destroyed (but alloc was reused/in place) -input was converted, input preserved, output is new malloc
One question, is there any way to undo in perlapi SAVEFREEPV
? like for sv_usepvn_flags
?
The function bytes_from_utf8() already does what what these two instances of duplicated code do.