awalker89 / openxlsx

R package for .xlsx file reading and writing.
Other
364 stars 79 forks source link

read.xlsx() returns wrong sheet if sheet is specified by name and workbook contains veryHidden sheets #434

Open heseber opened 5 years ago

heseber commented 5 years ago

Expected Behavior

The read.xlsx() call should return the correct sheet if the sheet is specified by the sheet name, irrespective of whether or not the workbook contains veryHidden sheets.

Actual Behavior

If the workbook contains veryHidden sheets, the names of such sheets are included in the list of sheet names returned by getSheetNames(). If read.xlsx() is used to request a sheet by name and if this sheet occurs after a veryHidden sheet in the workbook, then the sheet index is off by the number of state="veryHidden" sheets preceding the requested sheet in the workbook, and the wrong sheet is returned.

Steps to Reproduce the Problem

Use any file with a veryHidden sheet and request a sheet (by name) following that veryHidden sheet.

Example:

<sheet name="modMain" sheetId="18" state="veryHidden" r:id=""/>

For proprietary reasons, I cannot provide the file that triggered this issue.

Versions

Patch

diff --git a/R/wrappers.R b/R/wrappers.R
index 2aa7c0a..e4a6917 100644
--- a/R/wrappers.R
+++ b/R/wrappers.R
@@ -3061,7 +3061,9 @@ getSheetNames <- function(file){
   workbook <- xmlFiles[grepl("workbook.xml$", xmlFiles, perl = TRUE)]
   workbook <- readLines(workbook, warn=FALSE, encoding="UTF-8")
   workbook <-  removeHeadTag(workbook)
-  sheets <- unlist(regmatches(workbook, gregexpr("<sheet .*/sheets>", workbook, perl = TRUE)))
+  sheets <- unlist(regmatches(workbook, gregexpr("(?<=<sheets>).*(?=</sheets>)", workbook, perl = TRUE)))
+  sheets <- unlist(regmatches(sheets, gregexpr("<sheet[^>]*>", sheets, perl=TRUE)))
+  sheets <- grep("state=\"veryHidden\"", sheets, invert = TRUE, value = TRUE)
   sheetNames <- unlist(regmatches(sheets, gregexpr('(?<=name=")[^"]+', sheets, perl = TRUE)))
   sheetNames <- replaceXMLEntities(sheetNames)
heseber commented 5 years ago

It seems that the patch above fixes getSheetNames(), but it does not fix read.xlsx(), so another fix is needed:

diff --git a/R/readWorkbook.R b/R/readWorkbook.R
index 9f0dbde..bfd04b3 100644
--- a/R/readWorkbook.R
+++ b/R/readWorkbook.R
@@ -153,12 +153,15 @@ read.xlsx.default <- function(xlsxFile,

   workbook <- unlist(readLines(workbook, warn = FALSE, encoding = "UTF-8"))
   workbook <- removeHeadTag(workbook)
-  sheets <- unlist(regmatches(workbook, gregexpr("<sheet .*/sheets>", workbook, perl = TRUE)))
+  sheets <- unlist(regmatches(workbook, gregexpr("(?<=<sheets>).*(?=</sheets>)", workbook, perl = TRUE)))
+  sheets <- unlist(regmatches(sheets, gregexpr("<sheet[^>]*>", sheets, perl=TRUE)))
+  sheets <- grep("state=\"veryHidden\"", sheets, invert = TRUE, value = TRUE)

   ## make sure sheetId is 1 based
   sheetrId <- unlist(getRId(sheets))
   sheetNames <- unlist(regmatches(sheets, gregexpr('(?<=name=")[^"]+', sheets, perl = TRUE)))
-
+  sheetNames <- replaceXMLEntities(sheetNames)
+
   nSheets <- length(sheetrId)
   if(nSheets == 0)
     stop("Workbook has no worksheets")
heseber commented 5 years ago

I created a pull request for a fix of this issue (#436). The fix works slightly differently than the patches above - it does not remove "veryHidden" sheets but instead sheets with an empty "r:id".