Open MichaelChirico opened 4 years ago
The collation order and uniqueness are not necessarily identical and will depend on your locale. Since you're not using a stable sort, the order of elements that are identical in the collation order of your locale, but not identical values will be (semi)random.
It is expected if collation is ambiguous, as it is in some locales. However, I do not have Spanish_Mexico.1252 to hand and I can't reproduce it with es_ES.UTF-8.
you need to dig deeper and find which strings are changing place, then see how they compare. Something along the lines of
s1 <- sort(char_1) all(s1[-1] > s1[-329]) which(!(s1[-1] > s1[-329]))
ss1 <- sort(s1) identical(s1, ss1) which(s1 != sort(s1))
String that change place:
Following:
which(!(s1[-1] > s1[-329])) -> ind
dput(sdput(s1[-1][ind])
c("adelante”", "paliativos", "subsecuentedoñasdoñ")
which(s1 != sort(s1)) -> ind2 dput(s1[ind2])
c("caas", "caastañedacrismatt", "cçdej", "ccde", "çodul", "codul", "conmanej", "conmapñai", "conm", "conmal", "estre", "estrech", "estrechabirads", "estrechamnet", "estrechap", "estrector", "estregen", "estreimient", "estreiñ", "näuse", "nause", "paliativosactual", "paliativosconsult", "paliativosh", "paliativoshay", "paliativosp", "paliativossdolor", "paliativosseñal", "paliativos", "paliativosasintomatica", "paliativoscuers", "paliativosna", "pequ", "pequen", "pequeñ", "pequeñacon", "pequeñans", "pequeñit", "pequeñom", "pequeñomedian", "pequeñosmedian", "pequieñ", "tanmañ", "tanm", "ulcer", "ülcer" )
Created attachment 1652 [details] unsortable character vectors
Getting unexpected behavior with sort and order.
I have 2 character vectors: char_1 and char_2. Both length 329.
I get this if i try to order them equally, output is given as comment:
length(unique(char_1)) # 329 length(unique(char_2)) # 329
identical(char_1, char_2) # FALSE sum(char_1 %in% char_2) # 329 sum(char_2 %in% char_1) # 329
s1 <- sort(char_1) ### These vectors come from a (supposed) previous sorting but: identical(char_1, s1) # FALSE
s2 <- sort(char_2) identical(char_2, s2) # FALSE
identical(s1, s2) # FALSE (!)
# Moreover ... identical(sort(s1), s1) ## FALSE (!!) identical(sort(sort(s1)), sort(s1)) ## FALSE identical( sort(sort(sort(s1))), sort(s1))## TRUE
So it seems sort() can't unequivocally sort them. Is this in anyway expected? using order() gives the similar results.
attached is an .RData file with the character vectors in question.
Output of SessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit)
locale: [1] LC_COLLATE=Spanish_Mexico.1252 LC_CTYPE=Spanish_Mexico.1252 LC_MONETARY=Spanish_Mexico.1252 [4] LC_NUMERIC=C LC_TIME=Spanish_Mexico.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached): [1] tools_3.1.0
METADATA