JuliaStrings / utf8proc

a clean C library for processing UTF-8 Unicode data
http://juliastrings.github.io/utf8proc/
Other
1.05k stars 140 forks source link

utf8proc_map_custom custom_data should be `const void *` (not `void *`) and could corrupt memory #249

Open liquidaty opened 1 year ago

liquidaty commented 1 year ago

Hi,

utf8proc_map_custom takes a void *custom_data parameter. However, if the custom_data is modified by the custom function, utf8proc_map_custom might not work as expected-- and possibly corrupt memory-- because utf8proc_map_custom calls utf8proc_decompose_custom twice, and only the second time's results are kept, but there is no way to "reset" the custom data to its initial state before the second call.

As an example, imagine I use a custom transformation to replace the first character with 'A', but keep the rest of the string. Using utf8proc_map_custom would seem easy enough:

struct ctx {
  char start_of_string
};

static utf8proc_int32_t replaceFirstCharWithA(utf8proc_int32_t codepoint, struct ctx *ctx) {
  if(ctx->start_of_string)
    codepoint = 'A';
  ctx->start_of_string = 0;
  return codepoint;
}

void test() {
   struct ctx ctx;
   ctx.start_of_string = 1;
   utf8proc_map_custom(..., replaceFirstCharWithA, &ctx);
}

However, this will not actually work because utf8proc_map_custom only keeps the results of its second utf8proc_decompose_custom call, at which time ctx->start_of_string will already be set to 0.

I believe this could also lead to memory corruption if the above example was run with an input string that had a multi-byte first character (in which case the first run of utf8proc_decompose_custom would receive a length assuming a single-byte first char, but the second run would write a multi-byte first char).

It would be nice if there was a way to fix this issue inside utf8proc_map_custom without changing its signature, but that does not seem possible. But at a minimum, it would seem safer and more accurate if custom_data was a const void * instead of a void * (and arguably, a bug in the latter form as it currently stands)

liquidaty commented 1 year ago

Here is an alternative implementation of utf8proc_map_custom that provides a means to address this issue via an additional parameter specifying the number of bytes (if any) that will be cleared at the address specified by custom_data prior to the second call of utf8proc_decompose_custom. That way, the custom function can assume that the specified number of bytes will always be zero the first time it is called on the second invocation of utf8proc_decompose_custom

/*
 * This is an exact duplicate of the original `utf8proc_map_custom`
 * other than the lines relating to `custom_data_reset_bytes`
 */

UTF8PROC_DLLEXPORT utf8proc_ssize_t
utf8proc_map_custom_mutable(
                            const utf8proc_uint8_t *str, utf8proc_ssize_t strlen,
                            utf8proc_uint8_t **dstptr, utf8proc_option_t options,
                            utf8proc_custom_func custom_func, void *custom_data,
                            size_t custom_data_reset_bytes
                            ) {
  utf8proc_int32_t *buffer;
  utf8proc_ssize_t result;
  *dstptr = NULL;
  result = utf8proc_decompose_custom(str, strlen, NULL, 0, options, custom_func, custom_data);
  if (result < 0) return result;
  buffer = (utf8proc_int32_t *) malloc(((utf8proc_size_t)result) * sizeof(utf8proc_int32_t) + 1);
  if (!buffer) return UTF8PROC_ERROR_NOMEM;
  if(custom_data_reset_bytes > 0 && custom_data != NULL)
    memset(custom_data, 0, custom_data_reset_bytes);
  result = utf8proc_decompose_custom(str, strlen, buffer, result, options, custom_func, custom_data);
  if (result < 0) {
    free(buffer);
    return result;
  }
  result = utf8proc_reencode(buffer, result, options);
  if (result < 0) {
    free(buffer);
    return result;
  }
  {
    utf8proc_int32_t *newptr;
    newptr = (utf8proc_int32_t *) realloc(buffer, (size_t)result+1);
    if (newptr) buffer = newptr;
  }
  *dstptr = (utf8proc_uint8_t *)buffer;
  return result;
}
stevengj commented 1 year ago

I don't think the utf8proc_map_custom should assume that it is called in any particular order on the string. The documentation should probably be clearer on this point — it is simply for character substitution, whereas for anything context-dependent you should do your own mapping.

However, I'm reluctant to make the pointer const. For example, what if it performs some expensive calculation and simply wants to memo-ize/cache the results in the state?

liquidaty commented 1 year ago

I think the core of the issue is that this "map function might call your callback once, probably will call it twice per element" behavior is just plain weird. Is there any other context in the world where a map function does that? In any case, it seems it should be either eliminated, or at least heavily documented and offering an alternative that is more transparent.


what if it performs some expensive calculation and simply wants to memo-ize/cache the results in the state?

OK, point taken. But, see above

I don't think the utf8proc_map_custom should assume that it is called in any particular order on the string

fair enough-- but in that case, I would think it also conventional to provide an alternative where by the callback accepts (and is passed) an index parameter indicating the element position


If it can't be possible to only call the custom function once, instead of twice, per element, then maybe a solution would be offer a more granular form that accepts two custom funcs. Note however that this would not address the problem of the order assumption.


utf8proc_ssize_t utf8proc_map_custom_2(
                            const utf8proc_uint8_t *str, utf8proc_ssize_t strlen,
                            utf8proc_uint8_t **dstptr, utf8proc_option_t options,
                            utf8proc_custom_func_with_index custom_func1, void *custom_data1,
                             utf8proc_custom_func_with_index custom_func2, void *custom_data2
                            ) {
...
  result = utf8proc_decompose_custom(str, strlen, NULL, 0, options, custom_func1, custom_data1);
...
  result = utf8proc_decompose_custom(str, strlen, buffer, result, options, custom_func2, custom_data2);
}

utf8proc_ssize_t utf8proc_map_custom(
                            const utf8proc_uint8_t *str, utf8proc_ssize_t strlen,
                            utf8proc_uint8_t **dstptr, utf8proc_option_t options,
                            utf8proc_custom_func custom_func, void *custom_data
                            ) {
  return utf8proc_map_custom_2(str, strlen, dstptr, options, custom_func, custom_data, custom_func, custom_data);
}