[discussion] lazy load spreadsheet feature(s) for speed and memory footprint?

MathNya / umya-spreadsheet

A pure rust library for reading and writing spreadsheet files

MIT License

296 stars 47 forks source link

[discussion] lazy load spreadsheet feature(s) for speed and memory footprint? #158

Open agentjill opened 11 months ago

agentjill commented 11 months ago

As much as I have seen I think that the entire Excel file props are loaded into memory to read and write. To optimize performance, I suggest selectively loading and manipulating data in memory rather than the entire dataset. This could boost efficiency, especially for larger files and scenarios where only specific data is modified. I think most of the workloads are based on using a limited feature set anyway.

I'd appreciate a good discussion on this proposal.

schungx commented 11 months ago

For systems that support it (I suppose systems that work with Excel files should), the entire file can simply be mem-mapped into memory and accessed like it has been loaded. That might be easier to handle than to convert into a stream-based API...

agentjill commented 11 months ago

@schungx you raise a very valid point, but i think this library is not implemented in that way. The spreadsheet is decoded into memory props which is then manipulated as needed. And on write, the props stored are written to file.

schungx commented 11 months ago

True. I am not knowledgeable enough about the Excel file format to know how it is done, except that it is XML-based and zipped (probably compressed). With that, it may not be possible to stream it without reconstructing the whole thing in memory first, because essential data structures may reside at the very end...

agentjill commented 11 months ago

@schungx Yes but take in the use case of just reading the excel range into memory as vec of strings. For that it would be waste of resource to load in the styles, protections, tables etc. of that cell in the memory as it was not requested by the user. Also, the said are located in another xml file. Therefore, just retrieving the data by decoding in time is better for memory in my opinion. Entire loading is also a use case which user may specify for repeated file editing.

schungx commented 11 months ago

Perhaps flags to control what info should be loaded. That would give more flexibility.

MathNya commented 11 months ago

It implements lazy_read(path) in case a large file is to be edited. This implementation allows the user to specify when to deserialize on a per-sheet basis.

// Read raw data (not deserialized)
let path = std::path::Path::new("./tests/test_files/aaa.xlsx");
let mut book = umya_spreadsheet::reader::xlsx::lazy_read(path).unwrap();

// Deserialize the first sheet
let mut sheet = book.get_sheet_mut(&0).unwrap();

A first stepping stone could be, for example, the following renovations

reading of Excel files is done by lazy_read(path).
The system automatically deserializes each portion when needed.
a method to deserialize all parts is also provided.

agentjill commented 11 months ago

@MathNya, i felt lazy_read(path) is lacking sufficient documentation for exact use case. I was using book.get_sheet(&0). unwrap() and getting error.

Also, once the sheet is loaded that would mean things like table, protection etc are also loaded into memory props. I was meaning for that to be loaded lazily.

I am open for discussion on this issue. Thanks for responding

schungx commented 11 months ago

Or some form of a builder API:

let excel_reader = ExcelReaderBuilder::new().with_values().with_formulas().with_pictures().build();

// The following reads the Excel file Book.xlsx, but only with values, formulas and pictures.
// Other stuff like formatting, drawings, etc. are skipped (i.e. default values).
// It would be like the sheet is copied and "Paste-Special" onto a blank new sheet with only values and formulas.
let file = excel_reader.read("Book.xlsx").unwrap();

MathNya commented 11 months ago

@agentjill You are right. lazy_read(path) is a bit quirky. I have felt the need to improve the documentation.

Also, once the sheet is loaded that would mean things like table, protection etc are also loaded into memory props. I was meaning for that to be loaded lazily.

It is wonderful. I agree with the lazy load subdivision.

mjhmjhgkmj commented 7 months ago

[You can] use the parallel programming, Luke! worksheets are almost independent

pub fn read_reader<R: io::Read + io::Seek>( reader: R, with_sheetread: bool, ) -> Result<Spreadsheet, XlsxError> { ... for (, type_value, rel_target) in &workbook_rel { // can it be parallel? ... }

MathNya commented 7 months ago

@mjhmjhgkmj Very nice idea. The sections you mentioned can be processed in parallel. (Perhaps there are other areas where parallel processing is possible.)