gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Can't generate obscure Unicode code points #360

Closed amanka closed 4 years ago

amanka commented 4 years ago

I think I found an error in stringi::stri_rand_strings.

I'm running this: stringi::stri_rand_strings(n = n, length = 8, pattern = "[\\u0000-\\U0010ffff]")

?`stringi-search-charclass`

[\u0000-\U0010ffff]` Range – match all characters.

Note: I'm using two backslashes \\ and not one. If I do 1, I get this:

Browse[3]> .Call(C_stri_rand_strings, n=1, length, pattern = "[\u0000-\U0010ffff]")
Error: nul character not allowed (line 1)

When debugging it, I can follow it to here:

function (n, length, pattern = "[A-Za-z0-9]") 
{
  .Call(C_stri_rand_strings, n, length, pattern)
}

Which produces:

Error in debug(myfunc()) : could not find function "<U+0005F8EB><U+00066763><U+5724><U+000AC2C4><U+000F9977><U+000C9951><U+00095750><U+0003C371>"

The error doesn't always happen, the probability of occurrence seems to increase with n. Here is some sample output.

 stringi::stri_rand_strings(n = 10, length = 8, pattern = "[\\u0000-\\U0010ffff]")
 [1] "\U000ccc2d\U00045987\U0010fdcb\U0009235d\U0005cc91\U000d7cb2\U0010a752\U00057d38"                   "\U0006557f\U0008cdf2\U00076510\U00065bf2\U0003dc10\U00031d1e\U00046cbc\U000e9fff"         
 [3] "\U0008d080\U0010101f\U000fabb9\U000f238c\U000fbc7a\U0002ed1f\U0003f756\U000ff3a1"                            "\U0006e26c\U000a0f8c\U00023992\U0002dd35\U0002f3ec\U0002d179\U00012dad\U000dde3f"
 [5] "\U000c7a79\U000f4a89\U0010a02f\U000a58de\U00084349\U000ac462\U00055825\U000f53d5"                                     "\U00092d1c\ua9d0\U0002bc01\U000dd62d\U0007b2b6\U0005d209\U000891b3\U0010e143"                      
 [7] "\U0008e133\U000bed88뾚\U000d2a8b\U0005c1d2\U000be6e8\U00086831\U00091bc8"                           "\U0002b339\U0010761c\U0002119f\U000d7194\U000ede7e\U00065e5c\U00043c74\U0008975a"                           
 [9] "ᆽ\U0001d01e\U00024c64\U000108c5·\U0009c023\U000d8d63"                             "\U00034ab7\U000d365c\U0002ce97\U00045b41\U00075692\U000aa7f4\U000e2052\U000dc52c"                           
> stringi::stri_rand_strings(n = 10, length = 8, pattern = "[\\u0000-\\U0010ffff]")
 [1] "\U00036fb0\U0004e44a\U000f2380\U00106957\U000d71c0\U000273ac\U00025974\U000156d5"               "\U0008d280\U000830a4\U0007be13\U0009725b\U0005f0ec\U00046fbb\U000409dd\U0010376b"               "\U000a9a1f\U000df1ae\U000eb891\U000fc733\U0006e4a3\U000968ee\U00029c1e\U000e2c52"      "\U0001392e\U0007d13d\U000cb0bd\ua6e0\U000d4138\U000b2feb\U001043d4\U0002e8a4"
 [5] "\U000e9bf7\U000fb9f3\U00067437\U0009af75\U00106503\U000f7f8b\U000afb8c\U00068497"               "\U000738ef\U000f440e\U00073924\U000d043b\U000dd515\U000da1a9\U000715b3\U00049c32"               "\U000dadf4\U00020232\U0008f94e\U000aae7a\U0008ccb7\U000f6e7d\U00068aae\U000da4c8"      "\U000160a8\U0001eb9d\U00021761\U00103e55\U0002f389\U00048c1b\U00046a4b\U0004181e"     
 [9] "\U000593df\U000aa0b0\U000842d0\U000c0156\U00019372\U0010b765찋\U00066a6d"              "\U000d651d\U0001435a\U0009c447\U000791eb\U000d3112\U001006d7\U000b5f2e\U00105b7d"              
> stringi::stri_rand_strings(n = 10, length = 8, pattern = "[\\u0000-\\U0010ffff]")
Error in stringi::stri_rand_strings(n = 10, length = 8, pattern = "[\\u0000-\\U0010ffff]") : 
  internal error
> stringi::stri_rand_strings(n = 10, length = 8, pattern = "[\\u0000-\\U0010ffff]")
Error in stringi::stri_rand_strings(n = 10, length = 8, pattern = "[\\u0000-\\U0010ffff]") : 
  internal error

I can't find C_stri_rand_strings (presumably C code) in this repo, so I can't investigate further.

gagolews commented 4 years ago

In R (not just in stringi), you just can't insert a NUL character (ASCII/UTF8 code == 0) into a string, as it encodes the end of a string internally.

See https://en.wikipedia.org/wiki/Null_character and https://en.wikipedia.org/wiki/Null-terminated_string

HTH