Open liquidaty opened 1 year ago
Here is an alternative implementation of utf8proc_map_custom
that provides a means to address this issue via an additional parameter specifying the number of bytes (if any) that will be cleared at the address specified by custom_data
prior to the second call of utf8proc_decompose_custom
. That way, the custom function can assume that the specified number of bytes will always be zero the first time it is called on the second invocation of utf8proc_decompose_custom
/*
* This is an exact duplicate of the original `utf8proc_map_custom`
* other than the lines relating to `custom_data_reset_bytes`
*/
UTF8PROC_DLLEXPORT utf8proc_ssize_t
utf8proc_map_custom_mutable(
const utf8proc_uint8_t *str, utf8proc_ssize_t strlen,
utf8proc_uint8_t **dstptr, utf8proc_option_t options,
utf8proc_custom_func custom_func, void *custom_data,
size_t custom_data_reset_bytes
) {
utf8proc_int32_t *buffer;
utf8proc_ssize_t result;
*dstptr = NULL;
result = utf8proc_decompose_custom(str, strlen, NULL, 0, options, custom_func, custom_data);
if (result < 0) return result;
buffer = (utf8proc_int32_t *) malloc(((utf8proc_size_t)result) * sizeof(utf8proc_int32_t) + 1);
if (!buffer) return UTF8PROC_ERROR_NOMEM;
if(custom_data_reset_bytes > 0 && custom_data != NULL)
memset(custom_data, 0, custom_data_reset_bytes);
result = utf8proc_decompose_custom(str, strlen, buffer, result, options, custom_func, custom_data);
if (result < 0) {
free(buffer);
return result;
}
result = utf8proc_reencode(buffer, result, options);
if (result < 0) {
free(buffer);
return result;
}
{
utf8proc_int32_t *newptr;
newptr = (utf8proc_int32_t *) realloc(buffer, (size_t)result+1);
if (newptr) buffer = newptr;
}
*dstptr = (utf8proc_uint8_t *)buffer;
return result;
}
I don't think the utf8proc_map_custom
should assume that it is called in any particular order on the string. The documentation should probably be clearer on this point — it is simply for character substitution, whereas for anything context-dependent you should do your own mapping.
However, I'm reluctant to make the pointer const
. For example, what if it performs some expensive calculation and simply wants to memo-ize/cache the results in the state?
I think the core of the issue is that this "map function might call your callback once, probably will call it twice per element" behavior is just plain weird. Is there any other context in the world where a map
function does that? In any case, it seems it should be either eliminated, or at least heavily documented and offering an alternative that is more transparent.
what if it performs some expensive calculation and simply wants to memo-ize/cache the results in the state?
OK, point taken. But, see above
I don't think the utf8proc_map_custom should assume that it is called in any particular order on the string
fair enough-- but in that case, I would think it also conventional to provide an alternative where by the callback accepts (and is passed) an index parameter indicating the element position
If it can't be possible to only call the custom function once, instead of twice, per element, then maybe a solution would be offer a more granular form that accepts two custom funcs. Note however that this would not address the problem of the order assumption.
utf8proc_ssize_t utf8proc_map_custom_2(
const utf8proc_uint8_t *str, utf8proc_ssize_t strlen,
utf8proc_uint8_t **dstptr, utf8proc_option_t options,
utf8proc_custom_func_with_index custom_func1, void *custom_data1,
utf8proc_custom_func_with_index custom_func2, void *custom_data2
) {
...
result = utf8proc_decompose_custom(str, strlen, NULL, 0, options, custom_func1, custom_data1);
...
result = utf8proc_decompose_custom(str, strlen, buffer, result, options, custom_func2, custom_data2);
}
utf8proc_ssize_t utf8proc_map_custom(
const utf8proc_uint8_t *str, utf8proc_ssize_t strlen,
utf8proc_uint8_t **dstptr, utf8proc_option_t options,
utf8proc_custom_func custom_func, void *custom_data
) {
return utf8proc_map_custom_2(str, strlen, dstptr, options, custom_func, custom_data, custom_func, custom_data);
}
Hi,
utf8proc_map_custom
takes avoid *custom_data
parameter. However, if thecustom_data
is modified by the custom function, utf8proc_map_custom might not work as expected-- and possibly corrupt memory-- becauseutf8proc_map_custom
callsutf8proc_decompose_custom
twice, and only the second time's results are kept, but there is no way to "reset" the custom data to its initial state before the second call.As an example, imagine I use a custom transformation to replace the first character with 'A', but keep the rest of the string. Using
utf8proc_map_custom
would seem easy enough:However, this will not actually work because
utf8proc_map_custom
only keeps the results of its secondutf8proc_decompose_custom
call, at which timectx->start_of_string
will already be set to 0.I believe this could also lead to memory corruption if the above example was run with an input string that had a multi-byte first character (in which case the first run of
utf8proc_decompose_custom
would receive a length assuming a single-byte first char, but the second run would write a multi-byte first char).It would be nice if there was a way to fix this issue inside
utf8proc_map_custom
without changing its signature, but that does not seem possible. But at a minimum, it would seem safer and more accurate ifcustom_data
was aconst void *
instead of avoid *
(and arguably, a bug in the latter form as it currently stands)