brodieG / diffobj

Compare R Objects with a Diff
229 stars 12 forks source link

Reached theoretically unreachable branch 2 #152

Closed hadley closed 3 years ago

hadley commented 3 years ago
a <- 1:20
b <- 10:500

diffobj::ses(a, b, max.diffs = 100)
#> Error in diff_myers(args[["a"]], args[["b"]], max.diffs = args[["max.diffs"]], : Internal Error: reached theoretically unreachable branch 2, contact maintainer.

Created on 2021-01-11 by the reprex package (v0.3.0.9001)

(Originally reported in https://github.com/r-lib/waldo/issues/62)

brodieG commented 3 years ago

Thanks for reporting. I'll look at this next week.

brodieG commented 3 years ago

This specific issue is fixed in the "development" branch. I'm going to wait and see if the other one can be reproduced + I just submitted to CRAN so I'm going to wait a bit to resubmit.

IndrajeetPatil commented 3 years ago

@brodieG I can confirm that the development branch indeed solves this problem.

I ran into this error message ("reached theoretically unreachable branch 2, contact maintainer.") while writing a snapshot test for HTML table output. But after updating to diffobj from the development branch, I no longer get this error and the snapshot test works perfectly! 😄

Reprex:

a <- 1:20
b <- 10:500

diffobj::ses(a, b, max.diffs = 100)
#> Warning in diff_myers(args[["a"]], args[["b"]], max.diffs =
#> args[["max.diffs"]], : Exceeded `max.diffs`: 511 vs 100 allowed. Diff is
#> probably suboptimal.
#> [1] "1,20c1,491"

Created on 2021-02-08 by the reprex package (v1.0.0)

brodieG commented 3 years ago

Thanks for confirming. At some point in the near future I'm going to sit down and stare really hard at the code to see if I can figure out how I can get the other "unreachable" branch to be reached, and (whether I figure it out or not), resubmit to CRAN.

gadenbuie commented 3 years ago

Here's another set of test cases for you, based around comparing tibbles. I tried to reduce the reprex but I couldn't locate the problematic behavior beyond isolating the problematic columns.

I'm seeing these two errors

#> Error in diff_myers(args[["a"]], args[["b"]], max.diffs = args[["max.diffs"]], : Logic Error: Exceeded buffer for finding fake snake; contact maintainer.

#> Error in diff_myers(args[["a"]], args[["b"]], max.diffs = args[["max.diffs"]], : Internal Error: reached theoretically unreachable branch 2, contact maintainer.

and the second one goes away when using the development branch, while the other is reproducible in both.

CRAN ``` r library(palmerpenguins) library(dplyr) comparison <- filter(penguins, year == 2007 | flipper_length_mm > 220) test <- penguins %>% filter(year == 2008) diffobj::ses(test$bill_length_mm, comparison$bill_length_mm, max.diffs = 100) #> Error in diff_myers(args[["a"]], args[["b"]], max.diffs = args[["max.diffs"]], : Logic Error: Exceeded buffer for finding fake snake; contact maintainer. test <- penguins %>% filter(year == 2007, flipper_length_mm > 200) diffobj::ses(test$sex, comparison$sex, max.diffs = 100) #> Error in diff_myers(args[["a"]], args[["b"]], max.diffs = args[["max.diffs"]], : Internal Error: reached theoretically unreachable branch 2, contact maintainer. ``` Created on 2021-03-01 by the [reprex package](https://reprex.tidyverse.org) (v1.0.0)
Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.3 Patched (2020-04-28 r79534) #> os macOS 10.16 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2021-03-01 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date lib #> P assertthat 0.2.1 2019-03-21 [?] #> P backports 1.2.0 2020-11-02 [?] #> P cli 2.3.0.9000 2021-02-17 [?] #> P crayon 1.4.1 2021-02-08 [?] #> P diffobj 0.3.3 2021-01-07 [?] #> P digest 0.6.27 2020-10-24 [?] #> P dplyr * 1.0.2 2020-08-18 [?] #> P ellipsis 0.3.1 2020-05-15 [?] #> P evaluate 0.14 2019-05-28 [?] #> P fansi 0.4.2 2021-01-15 [?] #> P fs 1.5.0 2020-07-31 [?] #> P generics 0.1.0 2020-10-31 [?] #> P glue 1.4.2 2020-08-27 [?] #> P highr 0.8 2019-03-20 [?] #> P htmltools 0.5.1.9000 2021-02-25 [?] #> P knitr 1.30 2020-09-22 [?] #> P lifecycle 1.0.0 2021-02-15 [?] #> P magrittr 2.0.1 2020-11-17 [?] #> P palmerpenguins * 0.1.0 2020-07-23 [?] #> P pillar 1.5.0 2021-02-22 [?] #> P pkgconfig 2.0.3 2019-09-22 [?] #> P purrr 0.3.4 2020-04-17 [?] #> P R6 2.5.0 2020-10-28 [?] #> renv 0.12.3 2020-11-25 [1] #> P reprex 1.0.0 2021-01-27 [?] #> P rlang 0.4.10 2020-12-30 [?] #> P rmarkdown 2.7 2021-02-19 [?] #> P sessioninfo 1.1.1 2018-11-05 [?] #> P stringi 1.5.3 2020-09-09 [?] #> P stringr 1.4.0 2019-02-10 [?] #> P styler 1.3.2 2020-02-23 [?] #> P tibble 3.0.6 2021-01-29 [?] #> P tidyselect 1.1.0 2020-05-11 [?] #> P utf8 1.1.4 2018-05-24 [?] #> P vctrs 0.3.6 2020-12-17 [?] #> P withr 2.4.1 2021-01-26 [?] #> P xfun 0.19 2020-10-30 [?] #> P yaml 2.2.1 2020-02-01 [?] #> source #> CRAN (R 3.6.0) #> standard (@1.2.0) #> Github (r-lib/cli@9ef875e) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> Github (rstudio/htmltools@ac43afe) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.3) #> CRAN (R 3.6.0) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> standard (@0.12.3) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.3) #> CRAN (R 3.6.0) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> standard (@3.0.6) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> #> P ── Loaded and on-disk path mismatch. ```
development branch ``` r library(palmerpenguins) library(dplyr) comparison <- filter(penguins, year == 2007 | flipper_length_mm > 220) test <- penguins %>% filter(year == 2008) diffobj::ses(test$bill_length_mm, comparison$bill_length_mm, max.diffs = 100) #> Error in diff_myers(args[["a"]], args[["b"]], max.diffs = args[["max.diffs"]], : Logic Error: Exceeded buffer for finding fake snake; contact maintainer. test <- penguins %>% filter(year == 2007, flipper_length_mm > 200) diffobj::ses(test$sex, comparison$sex, max.diffs = 100) #> Warning in diff_myers(args[["a"]], args[["b"]], max.diffs = #> args[["max.diffs"]], : Exceeded `max.diffs`: 130 vs 100 allowed. Diff is #> probably suboptimal. #> [1] "1d0" "4,5c3,4" "6a6" "9,12c9,12" "15d14" "17a17" #> [7] "21d20" "22a22" "25d24" "27c26" "29a29" "31d30" #> [13] "32a32" "35c35" "36a37,140" ``` Created on 2021-03-01 by the [reprex package](https://reprex.tidyverse.org) (v1.0.0)
Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.3 Patched (2020-04-28 r79534) #> os macOS 10.16 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2021-03-01 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date lib #> P assertthat 0.2.1 2019-03-21 [?] #> P backports 1.2.0 2020-11-02 [?] #> P cli 2.3.0.9000 2021-02-17 [?] #> P crayon 1.4.1 2021-02-08 [?] #> diffobj 0.3.3.9000 2021-03-01 [1] #> P digest 0.6.27 2020-10-24 [?] #> P dplyr * 1.0.2 2020-08-18 [?] #> P ellipsis 0.3.1 2020-05-15 [?] #> P evaluate 0.14 2019-05-28 [?] #> P fansi 0.4.2 2021-01-15 [?] #> P fs 1.5.0 2020-07-31 [?] #> P generics 0.1.0 2020-10-31 [?] #> P glue 1.4.2 2020-08-27 [?] #> P highr 0.8 2019-03-20 [?] #> P htmltools 0.5.1.9000 2021-02-25 [?] #> P knitr 1.30 2020-09-22 [?] #> P lifecycle 1.0.0 2021-02-15 [?] #> P magrittr 2.0.1 2020-11-17 [?] #> P palmerpenguins * 0.1.0 2020-07-23 [?] #> P pillar 1.5.0 2021-02-22 [?] #> P pkgconfig 2.0.3 2019-09-22 [?] #> P purrr 0.3.4 2020-04-17 [?] #> P R6 2.5.0 2020-10-28 [?] #> renv 0.12.3 2020-11-25 [1] #> P reprex 1.0.0 2021-01-27 [?] #> P rlang 0.4.10 2020-12-30 [?] #> P rmarkdown 2.7 2021-02-19 [?] #> P sessioninfo 1.1.1 2018-11-05 [?] #> P stringi 1.5.3 2020-09-09 [?] #> P stringr 1.4.0 2019-02-10 [?] #> P styler 1.3.2 2020-02-23 [?] #> P tibble 3.0.6 2021-01-29 [?] #> P tidyselect 1.1.0 2020-05-11 [?] #> P utf8 1.1.4 2018-05-24 [?] #> P vctrs 0.3.6 2020-12-17 [?] #> P withr 2.4.1 2021-01-26 [?] #> P xfun 0.19 2020-10-30 [?] #> P yaml 2.2.1 2020-02-01 [?] #> source #> CRAN (R 3.6.0) #> standard (@1.2.0) #> Github (r-lib/cli@9ef875e) #> CRAN (R 3.6.2) #> Github (brodieg/diffobj@e718369) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> Github (rstudio/htmltools@ac43afe) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.3) #> CRAN (R 3.6.0) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> standard (@0.12.3) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.3) #> CRAN (R 3.6.0) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> standard (@3.0.6) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.2) #> CRAN (R 3.6.0) #> #> P ── Loaded and on-disk path mismatch. ```
brodieG commented 3 years ago

Perfect, thanks. I'll look in the next couple of days.

gadenbuie commented 3 years ago

Oh and one more data point for the fake snake error message. In my reprex, the error depends on max.diff and appears only when that value is between 72 and 203.

brodieG commented 3 years ago

The development branch now addresses @gadenbuie additional case. I want to look at this more closely as it's clear that the cases with low max.diffs are not well covered, so I'm not going to submit to CRAN just yet. I've found a couple of other issues already.

Aside, while it's great that you all are testing these low max.diffs cases, I'm wonder a little bit why. I imagine performance should be acceptable for the likely use case (e.g. waldo output in up to ~15ish failed tests). For example, on my very slow system I get:

> a <- sample(letters[1:5], 1000, replace=TRUE)
> b <- sample(letters[1:5], 1000, replace=TRUE)
> system.time(ses_dat(a, b))
   user  system elapsed 
  0.020   0.000   0.021 

These could run with max.diffs=1e3.

hadley commented 3 years ago

I'm setting the number low not because of performance, but because for most of the waldo cases, once you beyond a small number of diffs you're probably comparing two unrelated vectors, and it's simpler to just print them side by side, rather than displaying a complex diff (e.g. https://github.com/r-lib/waldo/issues/51). Is there a better approach?

brodieG commented 3 years ago

That makes sense. Maybe the thing to do there is to compare the match percentage, and if it is low report the vectors side by side instead of the diff'ed version? Using Garrick's example, we have:

> a <- test$bill_length_mm
> b <- comparison$bill_length_mm
> table(diffobj::ses_dat(a, b)[['op']])

 Match Insert Delete 
    25    115     89 

This requires some additional post-processing in waldo, but should be better than having a half-baked diff from hitting max.diffs. Whether or not that's worth the effort is another question.

hadley commented 3 years ago

Ah good idea, I've made a note to look into that next time I'm working on waldo.