enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.
https://ensoanalytics.com
Apache License 2.0
7.38k stars 323 forks source link

Reading a large Excel file resulted in an error about number of records #8111

Closed jdunkerley closed 11 months ago

jdunkerley commented 1 year ago

Possibly and issue with the underlying Excel reading library but when reading a large data file reported an error about 100,000,000 entries exceeded,

Data file: https://www.compare-school-performance.service.gov.uk/download-data?download=true&regions=0&filters=KS4PROV&fileformat=xls&year=2021-2022&meta=false

radeusgd commented 1 year ago

I tried this in the REPL and I was able to successfully load the file:

> f
>>> (File ..\download\2021-2022_england_ks4provisional.xlsx)
> w = f.read
<interactive_source>:1:1: warning: Unused variable w.
    1 | w = f.read
      | ^
>>> Nothing
> w
>>> (Excel_Workbook.Value Name: /xl/workbook.xml - Content Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml (File ..\download\2021-2022_england_ks4provisional.xlsx) False Infer)
> w.sheets
Evaluation failed with: Method `sheets` of type Excel_Workbook could not be found.
java.lang.Exception: Method `sheets` of type Excel_Workbook could not be found.
        at <enso>.<eval>(Unknown Source)
        at <enso>.Debug.breakpoint(Unknown Source)
        at <enso>.Rrepl::Rrepl::main(Rrepl.enso:44)
        at <unknown>.org.graalvm.polyglot.Value<Function>.execute(Unknown Source)

> w.sheet_names
>>> ['england']
> s = w.read "england"
<interactive_source>:1:1: warning: Unused variable s.
    1 | s = w.read "england"
      | ^
>>> Nothing
> s
>>> (Table.Value org.enso.table.data.table.Table@aca594d)
> s.info.print
   | Column     | Items Count | Value Type
---+------------+-------------+---------------------
 0 | RECTYPE    | 5763        | (Integer 64 bits)
 1 | LEA        | 5763        | Mixed
 2 | ESTAB      | 5763        | Mixed
 3 | URN        | 5763        | Mixed
 4 | SCHNAME    | 5763        | (Char Nothing True)
 5 | SCHNAME_AC | 5763        | (Char Nothing True)
 6 | ADDRESS1   | 5763        | Mixed
 7 | ADDRESS2   | 5763        | (Char Nothing True)
 8 | ADDRESS3   | 5763        | (Char Nothing True)
 9 | TOWN       | 5763        | (Char Nothing True)
� and 504 hidden rows.
radeusgd commented 1 year ago

@jdunkerley what were the exact steps that you were taking? Do you have somewhere the error message?

I guess I will try once more in the GUI to check this.

radeusgd commented 1 year ago

I've also tried to do this on IDE built from latest develop.

image

It seems to load all fine without any issues. The read node briefly had Panic type, but I think it was some intermittent thing - after reopening the vis and waiting a bit all loaded fine. It takes some significant time (quite bit for just a 10MB file), but nothing unreasonable.

radeusgd commented 12 months ago

I was able to reproduce the issue with the other file (2018-2019_england_ks4final.xlsx) provided by @jdunkerley

I'm looking for a workaround. There seems to be some promising possibilities.

radeusgd commented 12 months ago
from Standard.Base import all
from Standard.Table import all

polyglot java import org.apache.poi.xssf.usermodel.XSSFWorkbook
polyglot java import java.io.File as Java_File
polyglot java import java.io.FileInputStream as Java_FileInputStream

main =
    f = File.new "C:\NBO\download\2018-2019_england_ks4final.xlsx"
    path = f.normalize.path
    IO.println path

    jf = Java_File.new path
    wb = XSSFWorkbook.new jf
    IO.println wb
    IO.println wb.getNumberOfSheets

    IO.println (XSSFWorkbook.new (Java_FileInputStream.new jf))

shows me that the issue is when loading from an input stream:

C:\NBO\download\2018-2019_england_ks4final.xlsx
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Name: /xl/workbook.xml - Content Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml
1
Execution finished with an error: Tried to read data but the maximum length for this record type is 100,000,000.
If the file is not corrupt and not large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
        at <java> org.apache.poi.util.IOUtils.throwRecordTruncationException(IOUtils.java:607)
        at <java> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:249)
        at <java> org.apache.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:220)
        at <java> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:81)
        at <java> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
        at <java> org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132)
        at <java> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:319)
        at <java> org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:59)
        at <java> org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:290)
        at <java> org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:286)
        at <enso> excel-tests.main<arg-1>(excel-tests.enso:18:17-62)
        at <enso> excel-tests.main(excel-tests.enso:18:5-63)

when loading from a Java File it seems to work all fine - and is more memory efficient.

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for yesterday (2023-11-21):

Progress: Implementing the workbook workaround - accessing via File and not Input Stream to be more memory-efficient. Ensuring we close unused workbooks to free up resources. Designing how to materialize a stream into a file in a reasonable way. It should be finished by 2023-11-24.

Next Day: Next day I will be working on the same task. Continue.

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for yesterday (2023-11-22):

Progress: Approaching the problem of concurrently opening workbooks for read/write by keeping a pool of 'connections'. Integrating that with existing reader/writer. It should be finished by 2023-11-24.

Next Day: Next day I will be working on the same task. Ensure all tests pass. Add tests for streaming. Support for temporary file from a stream?

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for yesterday (2023-11-23):

Progress: Still working on Excel read/write. Reading works again, some writing tests passing, some still need some adaptation. It should be finished by 2023-11-24.

Next Day: Next day I will be working on the same task. Ensure all tests pass. Add tests for streaming. Support for temporary file from a stream?

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new 🔴 DELAY for the provided date (2023-11-24):

Summary: There is 5 days delay in implementation of the Reading a large Excel file resulted in an error about number of records (#8111) task. It will cause 3 days delay for the delivery of this weekly plan.

Delay Cause: The Apache POI library had some weird behaviours and so I had to redesign the approach to handling writing Excel files several times before I got it working the way we need. Still more time needed to implement stream support nicely.

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for the provided date (2023-11-24):

Progress: Got Excel writing working again after trying out several approaches that had their issues. Still some tests need to be fixed. It should be finished by 2023-11-29.

Next Day: Next day I will be working on the same task. Fix the remaining tests. Add tests for streaming. Support for temporary file from a stream?

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for yesterday (2023-11-27):

Progress: Got most tests working. Reported a bug found along the way. Added an additional test after discussion with James, testing the dry run thread-thru. One remaining write test to fix. It should be finished by 2023-11-29.

Next Day: Next day I will be working on the same task. Fix that one test. Add tests for streaming. Support for temporary file from a stream?

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for today (2023-11-28):

Progress: All write tests are working. Reworked handling of dry run in Table.write. Added a few more tests for various formats. Added ability to serve files on our httpbin server (a bit of experiments before the S3 mocks, but also useful here). Added tests for fetching and decoding Table from HTTP. Fetching xlsx works. xls detection does not. It should be finished by 2023-11-29.

Next Day: Next day I will be working on the same task. Implement detection of xls based on mime type or URI if available. See how we could implement destroying the temporary file related to the stream once workbook is disposed. See if we can try re-using the temp file from HTTP request materialization too (considering if this could become a followup ticket though).

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for yesterday (2023-11-29):

Progress: Implemented infernce of xls format based on optional stream metadata. Implemented prototype for Temporary_Facility. It should be finished by 2023-11-29.

Next Day: Next day I will be working on the same task. Test the temporary file facility. Get the PR ready for review.

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new 🔴 DELAY for yesterday (2023-11-29):

Summary: There is 1 days delay in implementation of the Reading a large Excel file resulted in an error about number of records (#8111) task. It will cause 1 days delay for the delivery of this weekly plan.

Delay Cause: Unexpected small feature that seemed worth doing as part of the PR - inferring xls format in stream mode (I thought it should work, but it was working only in some cases - now it should always work). Time needed to finish and test better handling of temporary files that avoids unnecessary duplication and disk space usage.

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new 🔴 DELAY for yesterday (2023-11-30):

Summary: There is 1 days delay in implementation of the Reading a large Excel file resulted in an error about number of records (#8111) task. It will cause 1 days delay for the delivery of this weekly plan.

Delay Cause: Unexpected test failures

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for yesterday (2023-11-30):

Progress: PR is feature ready. Added tests. Fixing tests issues. Split of the Test.group nesting into a separate task because it was more problematic than I thought initially (#8430). Investigating some weird unexpected test failure. It should be finished by 2023-12-01.

Next Day: Next day I will be working on the same task. Fix problematic test failures.

enso-bot[bot] commented 11 months ago

Radosław Waśko reports a new STANDUP for the provided date (2023-12-01):

Progress: Fixed CI issues - CI is now building - PR ready and only waiting for review. In the meantime, solved the Test.group nesting issue by allowing the nesting in a simple way. It should be finished by 2023-12-01.

Next Day: Next day I will be working on the #8352 task. Start work on cloud API followups