Closed jdunkerley closed 11 months ago
I tried this in the REPL and I was able to successfully load the file:
> f
>>> (File ..\download\2021-2022_england_ks4provisional.xlsx)
> w = f.read
<interactive_source>:1:1: warning: Unused variable w.
1 | w = f.read
| ^
>>> Nothing
> w
>>> (Excel_Workbook.Value Name: /xl/workbook.xml - Content Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml (File ..\download\2021-2022_england_ks4provisional.xlsx) False Infer)
> w.sheets
Evaluation failed with: Method `sheets` of type Excel_Workbook could not be found.
java.lang.Exception: Method `sheets` of type Excel_Workbook could not be found.
at <enso>.<eval>(Unknown Source)
at <enso>.Debug.breakpoint(Unknown Source)
at <enso>.Rrepl::Rrepl::main(Rrepl.enso:44)
at <unknown>.org.graalvm.polyglot.Value<Function>.execute(Unknown Source)
> w.sheet_names
>>> ['england']
> s = w.read "england"
<interactive_source>:1:1: warning: Unused variable s.
1 | s = w.read "england"
| ^
>>> Nothing
> s
>>> (Table.Value org.enso.table.data.table.Table@aca594d)
> s.info.print
| Column | Items Count | Value Type
---+------------+-------------+---------------------
0 | RECTYPE | 5763 | (Integer 64 bits)
1 | LEA | 5763 | Mixed
2 | ESTAB | 5763 | Mixed
3 | URN | 5763 | Mixed
4 | SCHNAME | 5763 | (Char Nothing True)
5 | SCHNAME_AC | 5763 | (Char Nothing True)
6 | ADDRESS1 | 5763 | Mixed
7 | ADDRESS2 | 5763 | (Char Nothing True)
8 | ADDRESS3 | 5763 | (Char Nothing True)
9 | TOWN | 5763 | (Char Nothing True)
� and 504 hidden rows.
@jdunkerley what were the exact steps that you were taking? Do you have somewhere the error message?
I guess I will try once more in the GUI to check this.
I've also tried to do this on IDE built from latest develop.
It seems to load all fine without any issues. The read
node briefly had Panic
type, but I think it was some intermittent thing - after reopening the vis and waiting a bit all loaded fine. It takes some significant time (quite bit for just a 10MB file), but nothing unreasonable.
I was able to reproduce the issue with the other file (2018-2019_england_ks4final.xlsx
) provided by @jdunkerley
I'm looking for a workaround. There seems to be some promising possibilities.
from Standard.Base import all
from Standard.Table import all
polyglot java import org.apache.poi.xssf.usermodel.XSSFWorkbook
polyglot java import java.io.File as Java_File
polyglot java import java.io.FileInputStream as Java_FileInputStream
main =
f = File.new "C:\NBO\download\2018-2019_england_ks4final.xlsx"
path = f.normalize.path
IO.println path
jf = Java_File.new path
wb = XSSFWorkbook.new jf
IO.println wb
IO.println wb.getNumberOfSheets
IO.println (XSSFWorkbook.new (Java_FileInputStream.new jf))
shows me that the issue is when loading from an input stream:
C:\NBO\download\2018-2019_england_ks4final.xlsx
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Name: /xl/workbook.xml - Content Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml
1
Execution finished with an error: Tried to read data but the maximum length for this record type is 100,000,000.
If the file is not corrupt and not large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
at <java> org.apache.poi.util.IOUtils.throwRecordTruncationException(IOUtils.java:607)
at <java> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:249)
at <java> org.apache.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:220)
at <java> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:81)
at <java> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
at <java> org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132)
at <java> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:319)
at <java> org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:59)
at <java> org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:290)
at <java> org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:286)
at <enso> excel-tests.main<arg-1>(excel-tests.enso:18:17-62)
at <enso> excel-tests.main(excel-tests.enso:18:5-63)
when loading from a Java File
it seems to work all fine - and is more memory efficient.
Radosław Waśko reports a new STANDUP for yesterday (2023-11-21):
Progress: Implementing the workbook workaround - accessing via File and not Input Stream to be more memory-efficient. Ensuring we close unused workbooks to free up resources. Designing how to materialize a stream into a file in a reasonable way. It should be finished by 2023-11-24.
Next Day: Next day I will be working on the same task. Continue.
Radosław Waśko reports a new STANDUP for yesterday (2023-11-22):
Progress: Approaching the problem of concurrently opening workbooks for read/write by keeping a pool of 'connections'. Integrating that with existing reader/writer. It should be finished by 2023-11-24.
Next Day: Next day I will be working on the same task. Ensure all tests pass. Add tests for streaming. Support for temporary file from a stream?
Radosław Waśko reports a new STANDUP for yesterday (2023-11-23):
Progress: Still working on Excel read/write. Reading works again, some writing tests passing, some still need some adaptation. It should be finished by 2023-11-24.
Next Day: Next day I will be working on the same task. Ensure all tests pass. Add tests for streaming. Support for temporary file from a stream?
Radosław Waśko reports a new 🔴 DELAY for the provided date (2023-11-24):
Summary: There is 5 days delay in implementation of the Reading a large Excel file resulted in an error about number of records (#8111) task. It will cause 3 days delay for the delivery of this weekly plan.
Delay Cause: The Apache POI library had some weird behaviours and so I had to redesign the approach to handling writing Excel files several times before I got it working the way we need. Still more time needed to implement stream support nicely.
Radosław Waśko reports a new STANDUP for the provided date (2023-11-24):
Progress: Got Excel writing working again after trying out several approaches that had their issues. Still some tests need to be fixed. It should be finished by 2023-11-29.
Next Day: Next day I will be working on the same task. Fix the remaining tests. Add tests for streaming. Support for temporary file from a stream?
Radosław Waśko reports a new STANDUP for yesterday (2023-11-27):
Progress: Got most tests working. Reported a bug found along the way. Added an additional test after discussion with James, testing the dry run thread-thru. One remaining write test to fix. It should be finished by 2023-11-29.
Next Day: Next day I will be working on the same task. Fix that one test. Add tests for streaming. Support for temporary file from a stream?
Radosław Waśko reports a new STANDUP for today (2023-11-28):
Progress: All write tests are working. Reworked handling of dry run in Table.write. Added a few more tests for various formats. Added ability to serve files on our httpbin server (a bit of experiments before the S3 mocks, but also useful here). Added tests for fetching and decoding Table from HTTP. Fetching xlsx works. xls detection does not. It should be finished by 2023-11-29.
Next Day: Next day I will be working on the same task. Implement detection of xls based on mime type or URI if available. See how we could implement destroying the temporary file related to the stream once workbook is disposed. See if we can try re-using the temp file from HTTP request materialization too (considering if this could become a followup ticket though).
Radosław Waśko reports a new STANDUP for yesterday (2023-11-29):
Progress: Implemented infernce of xls format based on optional stream metadata. Implemented prototype for Temporary_Facility. It should be finished by 2023-11-29.
Next Day: Next day I will be working on the same task. Test the temporary file facility. Get the PR ready for review.
Radosław Waśko reports a new 🔴 DELAY for yesterday (2023-11-29):
Summary: There is 1 days delay in implementation of the Reading a large Excel file resulted in an error about number of records (#8111) task. It will cause 1 days delay for the delivery of this weekly plan.
Delay Cause: Unexpected small feature that seemed worth doing as part of the PR - inferring xls format in stream mode (I thought it should work, but it was working only in some cases - now it should always work). Time needed to finish and test better handling of temporary files that avoids unnecessary duplication and disk space usage.
Radosław Waśko reports a new 🔴 DELAY for yesterday (2023-11-30):
Summary: There is 1 days delay in implementation of the Reading a large Excel file resulted in an error about number of records (#8111) task. It will cause 1 days delay for the delivery of this weekly plan.
Delay Cause: Unexpected test failures
Radosław Waśko reports a new STANDUP for yesterday (2023-11-30):
Progress: PR is feature ready. Added tests. Fixing tests issues. Split of the Test.group nesting into a separate task because it was more problematic than I thought initially (#8430). Investigating some weird unexpected test failure. It should be finished by 2023-12-01.
Next Day: Next day I will be working on the same task. Fix problematic test failures.
Radosław Waśko reports a new STANDUP for the provided date (2023-12-01):
Progress: Fixed CI issues - CI is now building - PR ready and only waiting for review. In the meantime, solved the Test.group nesting issue by allowing the nesting in a simple way. It should be finished by 2023-12-01.
Next Day: Next day I will be working on the #8352 task. Start work on cloud API followups
Possibly and issue with the underlying Excel reading library but when reading a large data file reported an error about
100,000,000
entries exceeded,Data file: https://www.compare-school-performance.service.gov.uk/download-data?download=true®ions=0&filters=KS4PROV&fileformat=xls&year=2021-2022&meta=false