JanMarvin / openxlsx2

openxlsx2 - read, write and modify xlsx files
https://janmarvin.github.io/openxlsx2/
Other
111 stars 11 forks source link

return shared formulas #1072

Closed d-lukasz closed 5 days ago

d-lukasz commented 3 weeks ago

Hi! I really like the package, helps me a lot when working with .xlsb files. Now I wanted to use the function wb_to_df to compare the formulas between the two files. I can see than whenever the formula was dragged down in the column, the formula is displayed only twice then nothing. I understand this is an approach to compress the information. When doing wb_load I can see an object being created table$worksheets[[5]]$sheet_data$cc that stores information on formula and if it "shared". Would it be possible to add the feature that allows to display all the formulas, even if they are dragged? Or maybe there is some easy way to override this that I can implement on my side?

JanMarvin commented 3 weeks ago

Hi @d-lukasz , thanks for the kind words! I’ve been working a lot on xlsb lately and I’m always happy to get some feedback. Especially since I don’t use it a lot.

regarding shared formulas: unfortunately I don’t have the answer you are looking for. We’d have to modify the formulas from the original cell. Unfortunately this is not trivial for all formulas. Especially if there are fixed columns or rows. If you want to look into this, that would be a very welcome addition.

The shared cells are in the cc table f_si column. Iirc it’s a zero based index.

JanMarvin commented 3 weeks ago

If you do not care about the formulas being resolved, you can try something like this:

library(openxlsx2)
library(dplyr)

packageVersion("openxlsx2")
#> [1] '1.7.0.9000'

# load an example file
wb <- wb_load("https://github.com/JanMarvin/openxlsx-data/raw/main/formula_checks.xlsb")

# shared formulas are on Sheet4 in range A9:D12
wb_to_df(wb, sheet = "Sheet4", dims = "A9:D12", col_names = FALSE)
#>             A          B          C          D
#> 9  2024-06-27 2024-06-27 2024-06-27 2024-06-27
#> 10 2024-06-27 2024-06-27 2024-06-27 2024-06-27
#> 11 2024-06-27 2024-06-27 2024-06-27 2024-06-27
#> 12 2024-06-27 2024-06-27 2024-06-27 2024-06-27

# only the formula that is shared returns a formula
wb_to_df(wb, sheet = "Sheet4", dims = "A9:D12", col_names = FALSE, show_formula = TRUE)
#>             A          B          C          D
#> 9     TODAY() 2024-06-27 2024-06-27 2024-06-27
#> 10 2024-06-27 2024-06-27 2024-06-27 2024-06-27
#> 11 2024-06-27 2024-06-27 2024-06-27 2024-06-27
#> 12 2024-06-27 2024-06-27 2024-06-27 2024-06-27

# open heart surgery
cc <- wb$worksheets[[4]]$sheet_data$cc

# fill the shared cells with formulas
cc_shared <- cc %>%
  filter(f_t == "shared") %>%
  group_by(f_si) %>%
  mutate(f = na_if(f, "")) %>%
  tidyr::fill(f, .direction = "down") %>%
  ungroup() %>%
  mutate(f = tidyr::replace_na(f, ""))

cc[match(cc_shared$r, cc$r), ] <- cc_shared

# put the heart back into the body
wb$worksheets[[4]]$sheet_data$cc <- cc

# now all formulas are shown
wb_to_df(wb, sheet = "Sheet4", dims = "A9:D12", col_names = FALSE, show_formula = TRUE)
#>          A       B       C       D
#> 9  TODAY() TODAY() TODAY() TODAY()
#> 10 TODAY() TODAY() TODAY() TODAY()
#> 11 TODAY() TODAY() TODAY() TODAY()
#> 12 TODAY() TODAY() TODAY() TODAY()

Otherwise we'd have to do some more formula evaluation:

So for a single cell formula =A1, if moving right we have to create =B1, =C1 etc. If moving to the bottom, =A2, =A3. For fixed columns and rows and 2D dims A1:B2 -> B1:C2 - > C1:D2 etc.

Please feel free to pick this up and write a function that solves this. A base R only PR is welcome!

JanMarvin commented 3 weeks ago

Changed the title, because it’s the same for xlsx/xlsm files

JanMarvin commented 6 days ago

Something like this might work

find_a1_notation <- function(string) {
  pattern <- "\\$?[A-Z]\\$?[0-9]+(:\\$?[A-Z]\\$?[0-9]+)?"
  stringi::stri_extract_all_regex(string, pattern)[[1]]
}

# Function to find the next cell or range
next_cell <- function(cell, cols = 0L, rows = 0L) {
  z <- NULL
  for (i in cell) {
    # Match individual cells and ranges
    match <- stringi::stri_match_first_regex(i, "^(\\$?)([A-Z]+)(\\$?)(\\d+)(:(\\$?)([A-Z]+)(\\$?)(\\d+))?$")

    if (is.na(match[1, 1])) stop("Invalid A1 notation")

    # Extract parts of the cell
    fixed_col1 <- match[2]
    col1 <- match[3]
    fixed_row1 <- match[4]
    row1 <- as.numeric(match[5])

    fixed_col2 <- match[7]
    col2 <- match[8]
    fixed_row2 <- match[9]
    row2 <- as.numeric(match[10])

    if (is.na(col2)) {

      # Handle individual cell
      next_col <- if (fixed_col1 == "") int2col(col2int(col1) + cols) else col1
      next_row <- if (fixed_row1 == "") row1 + rows else row1
      z <- c(z, paste0(fixed_col1, next_col, fixed_row1, next_row))

    } else {

      # Handle range
      next_col1 <- if (fixed_col1 == "") int2col(col2int(col1) + cols) else col1
      next_row1 <- if (fixed_row1 == "") row1 + rows else row1
      next_col2 <- if (fixed_col2 == "") int2col(col2int(col2) + cols) else col2
      next_row2 <- if (fixed_col2 == "") row2 + rows else row2
      z <- c(z, paste0(fixed_col1, next_col1, fixed_row1, next_row1, ":", fixed_col2, next_col2, fixed_row2, next_row2))

    }
  }

  as.data.frame(matrix(z, ncol = 2))
}

# Define the replacement function
replace_a1_notation <- function(string, matches, replacements) {

  strings <- data.frame(rep(string, NROW(replacements)))

  matches <- as.data.frame(
    matrix(matches, nrow = NROW(replacements), ncol = length(matches), byrow = TRUE)
  )

  repl_fun <- function(str, x, y) {
    for (i in seq_along(x)) {
      str <- stringi::stri_replace_first_fixed(str, x[i], y[i])
    }
    str
  }

  z <- NULL
  for (i in seq_len(NROW(strings))) {
    z <- c(z, repl_fun(strings[i, ], matches[i, ], replacements[i, ]))
  }

  z
}

library(openxlsx2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

packageVersion("openxlsx2")
#> [1] '1.8.0.9000'
wb <- wb_workbook()$
  add_worksheet()$
  add_data(x = matrix(rnorm(5*5), ncol = 5, nrow = 5))$
  add_formula(x = "SUM($A2:A2) + B$1", dims = "A8:E12", shared = TRUE)

cc <- wb$worksheets[[1]]$sheet_data$cc

# Rows where condition_var is not empty
cc_shared <- cc %>%
  filter(f_t == "shared") %>%
  group_by(f_si) %>%
  mutate(f = na_if(f, "")) %>%
  tidyr::fill(f, .direction = "down") %>%
  ungroup() %>%
  mutate(f = tidyr::replace_na(f, ""))

string <- cc_shared$f[1]

# calculate differences
cc_shared$cols <- col2int(cc_shared$c_r)      - col2int(cc_shared$c_r[1])
cc_shared$rows <- as.integer(cc_shared$row_r) - as.integer(cc_shared$row_r[1])

matches       <- find_a1_notation(string)
replacements  <- next_cell(matches, cc_shared$cols, cc_shared$rows)
new_string    <- replace_a1_notation(string, matches, replacements)

cc_shared$f <- new_string
cc_shared$cols <- NULL
cc_shared$rows <- NULL

cc[match(cc_shared$r, cc$r), ] <- cc_shared

wb$worksheets[[1]]$sheet_data$cc <- cc

wb_to_df(wb, dims = "A8:C12", col_names = FALSE, show_formula = TRUE)
#>                    A                 B                 C
#> 8  SUM($A2:A2) + B$1 SUM($A2:B2) + C$1 SUM($A2:C2) + D$1
#> 9  SUM($A3:A3) + B$1 SUM($A3:B3) + C$1 SUM($A3:C3) + D$1
#> 10 SUM($A4:A4) + B$1 SUM($A4:B4) + C$1 SUM($A4:C4) + D$1
#> 11 SUM($A5:A5) + B$1 SUM($A5:B5) + C$1 SUM($A5:C5) + D$1
#> 12 SUM($A6:A6) + B$1 SUM($A6:B6) + C$1 SUM($A6:C6) + D$1
JanMarvin commented 5 days ago

Hi @d-lukasz , I have pushed a pr that should handle shared formulas. At least my basic tests show promising results. Since you were the one that requested this, it would be great if you had a moment to test this. Especially how it behaves with real files.