MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/
20 stars 0 forks source link

[BUGZILLA #15952] Problem with sort? inconsistent results. #5428

Open MichaelChirico opened 4 years ago

MichaelChirico commented 4 years ago

Created attachment 1652 [details] unsortable character vectors

Getting unexpected behavior with sort and order.

I have 2 character vectors: char_1 and char_2. Both length 329.

I get this if i try to order them equally, output is given as comment:

length(unique(char_1)) # 329 length(unique(char_2)) # 329

identical(char_1, char_2) # FALSE sum(char_1 %in% char_2) # 329 sum(char_2 %in% char_1) # 329

s1 <- sort(char_1) ### These vectors come from a (supposed) previous sorting but: identical(char_1, s1) # FALSE

s2 <- sort(char_2) identical(char_2, s2) # FALSE

identical(s1, s2) # FALSE (!)

# Moreover ... identical(sort(s1), s1) ## FALSE (!!) identical(sort(sort(s1)), sort(s1)) ## FALSE identical( sort(sort(sort(s1))), sort(s1))## TRUE

So it seems sort() can't unequivocally sort them. Is this in anyway expected? using order() gives the similar results.

attached is an .RData file with the character vectors in question.

Output of SessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: [1] LC_COLLATE=Spanish_Mexico.1252 LC_CTYPE=Spanish_Mexico.1252 LC_MONETARY=Spanish_Mexico.1252 [4] LC_NUMERIC=C LC_TIME=Spanish_Mexico.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] tools_3.1.0


METADATA

MichaelChirico commented 4 years ago

The collation order and uniqueness are not necessarily identical and will depend on your locale. Since you're not using a stable sort, the order of elements that are identical in the collation order of your locale, but not identical values will be (semi)random.


METADATA

MichaelChirico commented 4 years ago

It is expected if collation is ambiguous, as it is in some locales. However, I do not have Spanish_Mexico.1252 to hand and I can't reproduce it with es_ES.UTF-8.

you need to dig deeper and find which strings are changing place, then see how they compare. Something along the lines of

s1 <- sort(char_1) all(s1[-1] > s1[-329]) which(!(s1[-1] > s1[-329]))

ss1 <- sort(s1) identical(s1, ss1) which(s1 != sort(s1))


METADATA

MichaelChirico commented 4 years ago

String that change place:

Following:

which(!(s1[-1] > s1[-329])) -> ind

dput(sdput(s1[-1][ind])

c("adelante”", "paliativos", "subsecuentedoñasdoñ")

And after:

which(s1 != sort(s1)) -> ind2 dput(s1[ind2])

c("caas", "caastañedacrismatt", "cçdej", "ccde", "çodul", "codul", "conmanej", "conmapñai", "conm", "conmal", "estre", "estrech", "estrechabirads", "estrechamnet", "estrechap", "estrector", "estregen", "estreimient", "estreiñ", "näuse", "nause", "paliativosactual", "paliativosconsult", "paliativosh", "paliativoshay", "paliativosp", "paliativossdolor", "paliativosseñal", "paliativos", "paliativosasintomatica", "paliativoscuers", "paliativosna", "pequ", "pequen", "pequeñ", "pequeñacon", "pequeñans", "pequeñit", "pequeñom", "pequeñomedian", "pequeñosmedian", "pequieñ", "tanmañ", "tanm", "ulcer", "ülcer" )


METADATA