colearendt / xlsx

An R package to interact with Excel files using the Apache POI java library
https://colearendt.github.io/xlsx/
85 stars 32 forks source link

Questions About xls objects #160

Open colearendt opened 4 years ago

colearendt commented 4 years ago

@RE-Chief opening a separate thread to discuss since #159 is unrelated

Hi Cole,

I have older xls files that I am trying to read into R. I cannot read them into R using any other means so I am trying to use your read.xlsx2 to read them.

I get the following error which I do not understand are you able to help me out please?

Regards Charles

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream

colearendt commented 4 years ago

@RE-Chief I am not super familiar with .xls objects (as they are quite old), however, I believe read.xlsx, write.xlsx and friends are not going to be super helpful (as is indicated by their being named after the xlsx format). However, I know that Apache POI does support the XLS format (which it calls HSSF for "Horrible Spreadsheet Format" 😂 ).

There may be other libraries to work with this file format too.

You can interact with the documents directly using something like this (note I need to create one from scratch since I do not have any handy):

library(xlsx)

new_wb <- createWorkbook(type="xls")
new_wb
#> [1] "Java-Object{org.apache.poi.hssf.usermodel.HSSFWorkbook@65e98b1c}"

s1 <- createSheet(new_wb)
addDataFrame(iris, s1)

new_wb_path <- tempfile(fileext = ".xls")
saveWorkbook(new_wb, new_wb_path)

read_wb <- loadWorkbook(new_wb_path)
all_sheets <- getSheets(read_wb)

head(xlsx::readColumns(all_sheets[[1]], 1, 5, 1))
#>   X. Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1  1          5.1         3.5          1.4         0.2
#> 2  2          4.9         3.0          1.4         0.2
#> 3  3          4.7         3.2          1.3         0.2
#> 4  4          4.6         3.1          1.5         0.2
#> 5  5          5.0         3.6          1.4         0.2
#> 6  6          5.4         3.9          1.7         0.4

Created on 2020-04-19 by the reprex package (v0.3.0)

RE-Chief commented 4 years ago

Many thanks for your reply Cole. My xls files must be pre-97 as I ran the code you kindly provided and I was able to open the created xls file with either read.xlsx or read.xlsx2. I cannot open my xls files with either, yet they will open with excel so I know they are valid files. I can pay for some software to do this for me but the money being asked for as crazy dollars. Thanks again for your efforts, stay well.

Kind regards Charles

colearendt commented 4 years ago

@RE-Chief Did you try using loadWorkbook() as I showed above? Can you share the results of your output, the errors, etc.? My expectation is that loadWorkbook() along with readRows(), readColumns() or readRange() should work for .xls files. read.xlsx or read.xlsx2 will not work.

colearendt commented 4 years ago

Also, you might try readxl, which seems to support xls as well: https://readxl.tidyverse.org/

RE-Chief commented 4 years ago

Thanks Cole, I have tried readxl previously.

read_wb <- loadWorkbook("C:/Users/Aquatic/Desktop/EP Snowy Model/OutputMS1/1.xls") Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream

Kind regards Charles

colearendt commented 4 years ago

Interesting!! Well that is a different one! Any chance these files are misnamed and their encoding is actually some other type of file?

I.e. you could try opening the file in a text editor like Notepad or Wordpad. If it is a CSV file (a text file with common separated values), opening it with xlsx or readxl in this manner will fail, but it will open just fine in Excel.

RE-Chief commented 4 years ago

Hi Cole, I have attached an example file. It opens in notepad but as gobbledigook. It definitely opens in excel and I can save it as an xlsx file which will then open in R. But the whole purpose of this is to be able write a loop in R to open the xls files and save them as xlsx files.

1.zip

Kind regards Charles

AmmarAli92 commented 2 years ago

Hi Cole, I have attached an example file. It opens in notepad but as gobbledigook. It definitely opens in excel and I can save it as an xlsx file which will then open in R. But the whole purpose of this is to be able write a loop in R to open the xls files and save them as xlsx files.

1.zip

Kind regards Charles

Hi Chief, I'm facing the same problem with .XLS files that I'm trying to process in R, I also tried readxl, XLConnect, xlsx and got no chance to open them in R. As you said, I can open them manually in excel and save them in xlsx format which will open in R but I have too many files that I need to process. any chances you solved this problem?

Thank you.

RE-Chief commented 2 years ago

@AmmarAli92 this issue was not resolved through R. I did manage to find a solution using libreoffice. This problem has reared its head again for me and I am currently working on a better solution. It may be a couple of weeks before it is resolved but I will let you know if and when that happens.

Kind regards Charles

colearendt commented 2 years ago

Finally got around to digging into this a smidge. It looks like this particular .xls file example is "invalid" according to the (newer, at least) Apache POI library, so will not be parsable without some modifications.

I tested this by digging in a little bit lower level, and seeing that the binary data format is not what Apache POI is expecting. Docs here and here

library(xlsx)
library(rJava)
myf <- .jnew("java/io/File", normalizePath("~/Downloads/bad.xls"))
poif <- .jnew("org/apache/poi/poifs/filesystem/POIFSFileSystem", myf)
#> Error in .jnew("org/apache/poi/poifs/filesystem/POIFSFileSystem", myf): org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x000A000200040009, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

Created on 2022-04-29 by the reprex package (v0.3.0)

Unfortunately, it looks like the mechanism that is producing your .xls files is either invalid or quite outdated (perhaps using OLE1?)

If the .xls file is produced by Excel, for instance, things seem to work just fine. (Looking at the binary diff, though - things do differ quite a bit)

library(xlsx)
loadWorkbook("~/Documents/bad2.xls")
#> [1] "Java-Object{org.apache.poi.hssf.usermodel.HSSFWorkbook@1951b871}"

Created on 2022-04-29 by the reprex package (v0.3.0)

Some possible solutions I unfortunately don't have the time to dig into directly:

I hope that helps!! Sorry for the trouble!

jiangfenglyu commented 1 year ago

Hi RE-Chief I'm working on xls file which is not produced by EXCEL, and also have the same problem, same error code, and try readxl Package which isn't working.

And I ressolve this problem, here is my analysis: Issue Analysis: when xls file is produced by other software which is not by EXCEL, it will occur message box and tell you file format and extension does not match, file maybe corrupted, and so on I think when R package,like readxl, xlsx , processing the xls file, it can't processing this message box. Solution: So I try another way First, you need install RDCOMClient, here is link https://stackoverflow.com/questions/73190036/how-to-convert-xls-to-xlsx-using-r and use function convert_XLS_File_To_XLSX as literally Attention: two parameters, the former one is old file,like "C:\oldfile.xls", the latter one like "C\newfile.xlsx" Second use read_xlsx or read.xlsx or readxl functions and it worked.

Thks Emmanuel Hamel

RE-Chief commented 1 year ago

Many thanks Emmanuel, I will check it out.

Kind regards Charles

On Thu, Nov 16, 2023 at 2:13 PM jiangfenglyu @.***> wrote:

Hi RE-Chief https://github.com/RE-Chief I'm working on xls file which is not produced by EXCEL, and also have the same problem, same error code, and try readxl Package which isn't working.

And I ressolve this problem, here is my analysis: Issue Analysis: when xls file is produced by other software which is not by EXCEL, it will occur message box and tell you file format and extension does not match, file maybe corrupted, and so on I think when R package,like readxl, xlsx , processing the xls file, it can't processing this message box. Solution: So I try another way First, you need install RDCOMClient, here is link https://stackoverflow.com/questions/73190036/how-to-convert-xls-to-xlsx-using-r and use function convert_XLS_File_To_XLSX as literally Attention: two parameters, the former one is old file,like "C:\oldfile.xls", the latter one like "C\newfile.xlsx" Second use read_xlsx or read.xlsx or readxl functions and it worked.

Thks Emmanuel Hamel https://stackoverflow.com/users/9245623/emmanuel-hamel

— Reply to this email directly, view it on GitHub https://github.com/colearendt/xlsx/issues/160#issuecomment-1813724033, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHQEQ4KBJBVPGFPKRSSIEDYEWAFNAVCNFSM4MMBVS52U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBRGM3TENBQGMZQ . You are receiving this because you were mentioned.Message ID: @.***>