Closed DavisVaughan closed 4 years ago
Wow, I appreciate your being so determined to gaze down into the abyss of stringi :)
For UTF-16 input, which is needed here, we have the StriContainerUTF16
collator class:
StriContainerUTF16 str_cont(str, vectorize_length);
// extracting the i-th string:
const UnicodeString* str_cur_data = &(str_cont.get(i));
const UChar* str_cur_s = str_cur_data->getBuffer();
const int str_cur_n = str_cur_data->length();
For a use case, see stri_startswith_coll()
in src/stri_search_coll_startsendswith.cpp
.
I have a class that serves as a (expanding) buffer for utf-8 data, see stri_trans_casemap()
in src/stri_trans_casemap.cpp
R_len_t bufsize = str_cont.getMaxNumBytes();
bufsize += 10; // a small margin
String8buf buf(bufsize);
// ...
bufneed = ucol_getSortKey(...)
if (bufneed > bufsize) {
bufsize = bufneed+1;
buf.resize(bufneed, false);
bufneed = ucol_getSortKey(...)
}
// ...
SET_STRING_ELT(ret, i, Rf_mkCharLenCE(buf.data(), buf_need, CE_UTF8));
should do the trick?
Alright, I've used the StriContainerUTF16
and the String8buf
, thanks!
Unfortunately I still get a crash, which still seems to come from Rf_mkCharLenCE()
😞 . Does anything else look suspicious?
The advice I've gotten from others is to ensure that the buffer is null terminated before calling Rf_mkCharLenCE()
, but I've checked if key_char_size == strlen(key_buffer.data())
and it always seems to be true, so I don't think that is the problem.
Wow, okay, I think I have found the issue.
# I was doing:
p_ret[i] = Rf_mkCharLenCE(key_buffer.data(), key_char_size, CE_UTF8);
# Rather than:
SET_STRING_ELT(ret, i, Rf_mkCharLenCE(key_buffer.data(), key_char_size, CE_UTF8));
Changing to the second method no longer results in any crashes. I thought they were the same thing, but after looking at SET_STRING_ELT()
s definition, and at a message from Luke Tierney about this, it does seem that SET_STRING_ELT()
is safer
https://github.com/wch/r-source/blob/abb550c99b3927e5fc03d12f1a8e7593fddc04d2/src/main/memory.c#L4025-L4043
http://homepage.stat.uiowa.edu/~luke/R/barrier.html
Alright @gagolews, I'm done and am ready for a review. I've added a few tests, documentation, and updated NEWS. Let me know if you need anything else, or feel free to take over and polish it off. Thanks for the help so far!
Hi @gagolews,
I'd like to request that this new function,
stri_sort_key()
, be added to stringi. It generates a character sort key for a character vector, which can then be used as an ordering proxy by other sorting algorithms to order in a specific locale.Specifically, it would be neat to use this in vctrs with
vec_order()
, where we use a character radix sort.I've done the best I can with this PR, but it still requires some changes, and I was hoping you could help me out with that. Specifically:
I use
malloc()
andfree()
to generate growable buffers. It seems safe, but you might have a better way.I assume my usage of
u_uastrcpy()
is correct to copy aconst char*
into aUChar*
, but I'm not really sure.The biggest issue is that, while this function generally seems to work, it crashes R with large character vectors. I've tried to debug this for a day or two now, but can't seem to figure out what it could be. The crash seems to come from
Rf_mkCharLenCE()
, but I can't figure out why. This is what I am hoping you could provide some insight on, since you are the stringi expert. See below.This code crashes for me pretty consistently:
Compiling with
CXX11FLAGS=-O0 -g
and then running in lldb gives me:A few resources I used for learning about sort keys: http://userguide.icu-project.org/collation/examples http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys http://userguide.icu-project.org/collation/api https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a1f83f9c96a0950e2c22bd5c5c31ff6bf