gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Implement `stri_sort_key()` #386

Closed DavisVaughan closed 4 years ago

DavisVaughan commented 4 years ago

Hi @gagolews,

I'd like to request that this new function, stri_sort_key(), be added to stringi. It generates a character sort key for a character vector, which can then be used as an ordering proxy by other sorting algorithms to order in a specific locale.

Specifically, it would be neat to use this in vctrs with vec_order(), where we use a character radix sort.

I've done the best I can with this PR, but it still requires some changes, and I was hoping you could help me out with that. Specifically:

This code crashes for me pretty consistently:

library(stringi)
#devtools::load_all(".")
set.seed(123)
x <- stringi::stri_rand_strings(1e6, sample(100, 1e6, TRUE))
y <- stringi:::stri_sort_key(x)

Compiling with CXX11FLAGS=-O0 -g and then running in lldb gives me:

libR.dylib was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 18971 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xf0002821)
    frame #0: 0x00000001001e0bfc libR.dylib`RunGenCollect(size_needed=16) at memory.c:1748:5 [opt]
Target 0: (R) stopped.
(lldb) up
frame #1: 0x00000001001d9ae5 libR.dylib`R_gc_internal(size_needed=16) at memory.c:3086:2 [opt]
(lldb) up
frame #2: 0x00000001001dae67 libR.dylib`Rf_allocVector3(type=9, length=86, allocator=0x0000000000000000) at memory.c:2738:2 [opt]
(lldb) up
(lldb) up
frame #4: 0x000000013289729e stringi.so`stri_sort_key(str=0x0000000108fd0000, opts_collator=0x00000001010052e0) at stri_sort.cpp:583:18
   580        // which we don't want to copy into the R CHARSXP
   581        R_len_t key_char_size = key_size - 1;
   582 
-> 583        p_ret[i] = Rf_mkCharLenCE((char*) key_buffer, key_char_size, CE_UTF8);
   584     }
   585 
   586     // TODO: Use some C++ alternative to malloc/free?

A few resources I used for learning about sort keys: http://userguide.icu-project.org/collation/examples http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys http://userguide.icu-project.org/collation/api https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a1f83f9c96a0950e2c22bd5c5c31ff6bf

gagolews commented 4 years ago

Wow, I appreciate your being so determined to gaze down into the abyss of stringi :)

For UTF-16 input, which is needed here, we have the StriContainerUTF16 collator class:

StriContainerUTF16 str_cont(str, vectorize_length); 

// extracting the i-th string:
const UnicodeString* str_cur_data = &(str_cont.get(i));
const UChar* str_cur_s = str_cur_data->getBuffer();
const int str_cur_n = str_cur_data->length();

For a use case, see stri_startswith_coll() in src/stri_search_coll_startsendswith.cpp.

gagolews commented 4 years ago

I have a class that serves as a (expanding) buffer for utf-8 data, see stri_trans_casemap() in src/stri_trans_casemap.cpp

R_len_t bufsize = str_cont.getMaxNumBytes();
bufsize += 10; // a small margin
String8buf buf(bufsize);

// ...
bufneed =  ucol_getSortKey(...)
if (bufneed > bufsize) {
bufsize = bufneed+1;
buf.resize(bufneed, false);
bufneed =  ucol_getSortKey(...)
}

// ...
SET_STRING_ELT(ret, i, Rf_mkCharLenCE(buf.data(), buf_need, CE_UTF8));

should do the trick?

DavisVaughan commented 4 years ago

Alright, I've used the StriContainerUTF16 and the String8buf, thanks!

Unfortunately I still get a crash, which still seems to come from Rf_mkCharLenCE() 😞 . Does anything else look suspicious?

The advice I've gotten from others is to ensure that the buffer is null terminated before calling Rf_mkCharLenCE(), but I've checked if key_char_size == strlen(key_buffer.data()) and it always seems to be true, so I don't think that is the problem.

DavisVaughan commented 4 years ago

Wow, okay, I think I have found the issue.

# I was doing:
p_ret[i] = Rf_mkCharLenCE(key_buffer.data(), key_char_size, CE_UTF8);

# Rather than:
SET_STRING_ELT(ret, i, Rf_mkCharLenCE(key_buffer.data(), key_char_size, CE_UTF8));

Changing to the second method no longer results in any crashes. I thought they were the same thing, but after looking at SET_STRING_ELT()s definition, and at a message from Luke Tierney about this, it does seem that SET_STRING_ELT() is safer https://github.com/wch/r-source/blob/abb550c99b3927e5fc03d12f1a8e7593fddc04d2/src/main/memory.c#L4025-L4043 http://homepage.stat.uiowa.edu/~luke/R/barrier.html

DavisVaughan commented 4 years ago

Alright @gagolews, I'm done and am ready for a review. I've added a few tests, documentation, and updated NEWS. Let me know if you need anything else, or feel free to take over and polish it off. Thanks for the help so far!