cynkra / constructive

Display Idiomatic Code to Construct Most R Objects
https://cynkra.github.io/constructive
Other
129 stars 6 forks source link

general approach to recognise non canonical memory structures ? #362

Open moodymudskipper opened 6 months ago

moodymudskipper commented 6 months ago

R can create negative zeros, NAs, NaNs, that are mostly not recognised by R functions.

https://twitter.com/antoine_fabri/status/1778467270819213778

Should we take care of those ? This part is not too hard, the only thing is that it might confuse the user, and it means we won't compress c(0, -0, 0, 0) into rep(0, 4) for instance.

However the following shows that this sign does matter:

sign(1/(-0))
#> [1] -1
sign(1/0)
#> [1] 1

This byte issue comes up also with bit64 integers, 0 and NA are considered identical and negative values are all considered identical because the package does some bit hacking.

Defining row.names as c(NA, -n) rather than 1:n also creates "identical" objects with a different serialisation.

We could have also other types of corruptions, like below:

serialize(TRUE, NULL)
#>  [1] 58 0a 00 00 00 03 00 04 02 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
#> [26] 00 0a 00 00 00 01 00 00 00 01
serialize(FALSE, NULL)
#>  [1] 58 0a 00 00 00 03 00 04 02 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
#> [26] 00 0a 00 00 00 01 00 00 00 00
serialize(NA, NULL)
#>  [1] 58 0a 00 00 00 03 00 04 02 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
#> [26] 00 0a 00 00 00 01 80 00 00 00
true_s <- serialize(TRUE, NULL)

true_s2 <- true_s
true_s2[35] <- as.raw(2)
true_s2
#>  [1] 58 0a 00 00 00 03 00 04 02 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
#> [26] 00 0a 00 00 00 01 00 00 00 02
TRUE2 <- unserialize(true_s2)
TRUE2
#> [1] TRUE
identical(TRUE, TRUE2)
#> [1] FALSE
isTRUE(TRUE2)
#> [1] TRUE
rlang::is_true(TRUE2)
#> [2] FALSE

Created on 2024-04-12 with reprex v2.0.2

In that case it's interesting that identical() actually sees the difference, so we have 2 different TRUE values.

Encoding hell is another issue.

x <- "É"
y <- iconv(x, from="UTF-8", to="latin1")
identical(x, y)
#> [1] TRUE
serialize(x, NULL)
#>  [1] 58 0a 00 00 00 03 00 04 02 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
#> [26] 00 10 00 00 00 01 00 00 80 09 00 00 00 02 c3 89
serialize(y, NULL)
#>  [1] 58 0a 00 00 00 03 00 04 02 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
#> [26] 00 10 00 00 00 01 00 00 40 09 00 00 00 01 c9

I'm afraid that if we're too agressive about serialising everything it will slow down the package, but I also really want this package to be helpful in these difficult corner cases, maybe we can have an argument for deep checks, and solve some specific cases with special casing.

moodymudskipper commented 6 months ago

Also waldo doesn't see those. Ultimately we really need our own waldo, with:

I suppose the output of construct_issues() is not used in snapshots so this should not be a breaking change in practice.

moodymudskipper commented 6 months ago

was closed by mistake

moodymudskipper commented 6 months ago

Maybe we test if the serialisation is correct, and if it's not we rerun more carefully ?