Implement `stri_sort_key()`

DavisVaughan commented 4 years ago

Hi @gagolews,

I'd like to request that this new function, stri_sort_key(), be added to stringi. It generates a character sort key for a character vector, which can then be used as an ordering proxy by other sorting algorithms to order in a specific locale.

Specifically, it would be neat to use this in vctrs with vec_order(), where we use a character radix sort.

I've done the best I can with this PR, but it still requires some changes, and I was hoping you could help me out with that. Specifically:

I use malloc() and free() to generate growable buffers. It seems safe, but you might have a better way.
I assume my usage of u_uastrcpy() is correct to copy a const char* into a UChar*, but I'm not really sure.
The biggest issue is that, while this function generally seems to work, it crashes R with large character vectors. I've tried to debug this for a day or two now, but can't seem to figure out what it could be. The crash seems to come from Rf_mkCharLenCE(), but I can't figure out why. This is what I am hoping you could provide some insight on, since you are the stringi expert. See below.

This code crashes for me pretty consistently:

library(stringi)
#devtools::load_all(".")
set.seed(123)
x <- stringi::stri_rand_strings(1e6, sample(100, 1e6, TRUE))
y <- stringi:::stri_sort_key(x)

Compiling with CXX11FLAGS=-O0 -g and then running in lldb gives me:

libR.dylib was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 18971 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xf0002821)
    frame #0: 0x00000001001e0bfc libR.dylib`RunGenCollect(size_needed=16) at memory.c:1748:5 [opt]
Target 0: (R) stopped.
(lldb) up
frame #1: 0x00000001001d9ae5 libR.dylib`R_gc_internal(size_needed=16) at memory.c:3086:2 [opt]
(lldb) up
frame #2: 0x00000001001dae67 libR.dylib`Rf_allocVector3(type=9, length=86, allocator=0x0000000000000000) at memory.c:2738:2 [opt]
(lldb) up
(lldb) up
frame #4: 0x000000013289729e stringi.so`stri_sort_key(str=0x0000000108fd0000, opts_collator=0x00000001010052e0) at stri_sort.cpp:583:18
   580        // which we don't want to copy into the R CHARSXP
   581        R_len_t key_char_size = key_size - 1;
   582 
-> 583        p_ret[i] = Rf_mkCharLenCE((char*) key_buffer, key_char_size, CE_UTF8);
   584     }
   585 
   586     // TODO: Use some C++ alternative to malloc/free?

A few resources I used for learning about sort keys: http://userguide.icu-project.org/collation/examples http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys http://userguide.icu-project.org/collation/api https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a1f83f9c96a0950e2c22bd5c5c31ff6bf

gagolews commented 4 years ago

Wow, I appreciate your being so determined to gaze down into the abyss of stringi :)

For UTF-16 input, which is needed here, we have the StriContainerUTF16 collator class:

StriContainerUTF16 str_cont(str, vectorize_length); 

// extracting the i-th string:
const UnicodeString* str_cur_data = &(str_cont.get(i));
const UChar* str_cur_s = str_cur_data->getBuffer();
const int str_cur_n = str_cur_data->length();

For a use case, see stri_startswith_coll() in src/stri_search_coll_startsendswith.cpp.

gagolews commented 4 years ago

I have a class that serves as a (expanding) buffer for utf-8 data, see stri_trans_casemap() in src/stri_trans_casemap.cpp

R_len_t bufsize = str_cont.getMaxNumBytes();
bufsize += 10; // a small margin
String8buf buf(bufsize);

// ...
bufneed =  ucol_getSortKey(...)
if (bufneed > bufsize) {
bufsize = bufneed+1;
buf.resize(bufneed, false);
bufneed =  ucol_getSortKey(...)
}

// ...
SET_STRING_ELT(ret, i, Rf_mkCharLenCE(buf.data(), buf_need, CE_UTF8));

should do the trick?

DavisVaughan commented 4 years ago

Alright, I've used the StriContainerUTF16 and the String8buf, thanks!

Unfortunately I still get a crash, which still seems to come from Rf_mkCharLenCE() 😞 . Does anything else look suspicious?

The advice I've gotten from others is to ensure that the buffer is null terminated before calling Rf_mkCharLenCE(), but I've checked if key_char_size == strlen(key_buffer.data()) and it always seems to be true, so I don't think that is the problem.

DavisVaughan commented 4 years ago

Wow, okay, I think I have found the issue.

# I was doing:
p_ret[i] = Rf_mkCharLenCE(key_buffer.data(), key_char_size, CE_UTF8);

# Rather than:
SET_STRING_ELT(ret, i, Rf_mkCharLenCE(key_buffer.data(), key_char_size, CE_UTF8));

Changing to the second method no longer results in any crashes. I thought they were the same thing, but after looking at SET_STRING_ELT()s definition, and at a message from Luke Tierney about this, it does seem that SET_STRING_ELT() is safer https://github.com/wch/r-source/blob/abb550c99b3927e5fc03d12f1a8e7593fddc04d2/src/main/memory.c#L4025-L4043 http://homepage.stat.uiowa.edu/~luke/R/barrier.html

DavisVaughan commented 4 years ago

Alright @gagolews, I'm done and am ready for a review. I've added a few tests, documentation, and updated NEWS. Let me know if you need anything else, or feel free to take over and polish it off. Thanks for the help so far!

gagolews / stringi

Implement `stri_sort_key()` #386