Reading a large Excel file resulted in an error about number of records

jdunkerley commented 1 year ago

Possibly and issue with the underlying Excel reading library but when reading a large data file reported an error about 100,000,000 entries exceeded,

Data file: https://www.compare-school-performance.service.gov.uk/download-data?download=true&regions=0&filters=KS4PROV&fileformat=xls&year=2021-2022&meta=false

radeusgd commented 1 year ago

I tried this in the REPL and I was able to successfully load the file:

> f
>>> (File ..\download\2021-2022_england_ks4provisional.xlsx)
> w = f.read
<interactive_source>:1:1: warning: Unused variable w.
    1 | w = f.read
      | ^
>>> Nothing
> w
>>> (Excel_Workbook.Value Name: /xl/workbook.xml - Content Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml (File ..\download\2021-2022_england_ks4provisional.xlsx) False Infer)
> w.sheets
Evaluation failed with: Method `sheets` of type Excel_Workbook could not be found.
java.lang.Exception: Method `sheets` of type Excel_Workbook could not be found.
        at <enso>.<eval>(Unknown Source)
        at <enso>.Debug.breakpoint(Unknown Source)
        at <enso>.Rrepl::Rrepl::main(Rrepl.enso:44)
        at <unknown>.org.graalvm.polyglot.Value<Function>.execute(Unknown Source)

> w.sheet_names
>>> ['england']
> s = w.read "england"
<interactive_source>:1:1: warning: Unused variable s.
    1 | s = w.read "england"
      | ^
>>> Nothing
> s
>>> (Table.Value org.enso.table.data.table.Table@aca594d)
> s.info.print
   | Column     | Items Count | Value Type
---+------------+-------------+---------------------
 0 | RECTYPE    | 5763        | (Integer 64 bits)
 1 | LEA        | 5763        | Mixed
 2 | ESTAB      | 5763        | Mixed
 3 | URN        | 5763        | Mixed
 4 | SCHNAME    | 5763        | (Char Nothing True)
 5 | SCHNAME_AC | 5763        | (Char Nothing True)
 6 | ADDRESS1   | 5763        | Mixed
 7 | ADDRESS2   | 5763        | (Char Nothing True)
 8 | ADDRESS3   | 5763        | (Char Nothing True)
 9 | TOWN       | 5763        | (Char Nothing True)
� and 504 hidden rows.

radeusgd commented 1 year ago

@jdunkerley what were the exact steps that you were taking? Do you have somewhere the error message?

I guess I will try once more in the GUI to check this.

radeusgd commented 1 year ago

I've also tried to do this on IDE built from latest develop.

It seems to load all fine without any issues. The read node briefly had Panic type, but I think it was some intermittent thing - after reopening the vis and waiting a bit all loaded fine. It takes some significant time (quite bit for just a 10MB file), but nothing unreasonable.

radeusgd commented 12 months ago

I was able to reproduce the issue with the other file (2018-2019_england_ks4final.xlsx) provided by @jdunkerley

I'm looking for a workaround. There seems to be some promising possibilities.

radeusgd commented 12 months ago

from Standard.Base import all
from Standard.Table import all

polyglot java import org.apache.poi.xssf.usermodel.XSSFWorkbook
polyglot java import java.io.File as Java_File
polyglot java import java.io.FileInputStream as Java_FileInputStream

main =
    f = File.new "C:\NBO\download\2018-2019_england_ks4final.xlsx"
    path = f.normalize.path
    IO.println path

    jf = Java_File.new path
    wb = XSSFWorkbook.new jf
    IO.println wb
    IO.println wb.getNumberOfSheets

    IO.println (XSSFWorkbook.new (Java_FileInputStream.new jf))

shows me that the issue is when loading from an input stream:

C:\NBO\download\2018-2019_england_ks4final.xlsx
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Name: /xl/workbook.xml - Content Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml
1
Execution finished with an error: Tried to read data but the maximum length for this record type is 100,000,000.
If the file is not corrupt and not large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
        at <java> org.apache.poi.util.IOUtils.throwRecordTruncationException(IOUtils.java:607)
        at <java> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:249)
        at <java> org.apache.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:220)
        at <java> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:81)
        at <java> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
        at <java> org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132)
        at <java> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:319)
        at <java> org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:59)
        at <java> org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:290)
        at <java> org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:286)
        at <enso> excel-tests.main<arg-1>(excel-tests.enso:18:17-62)
        at <enso> excel-tests.main(excel-tests.enso:18:5-63)

when loading from a Java File it seems to work all fine - and is more memory efficient.