Open agentjill opened 11 months ago
For systems that support it (I suppose systems that work with Excel files should), the entire file can simply be mem-mapped into memory and accessed like it has been loaded. That might be easier to handle than to convert into a stream-based API...
@schungx you raise a very valid point, but i think this library is not implemented in that way. The spreadsheet is decoded into memory props which is then manipulated as needed. And on write, the props stored are written to file.
True. I am not knowledgeable enough about the Excel file format to know how it is done, except that it is XML-based and zipped (probably compressed). With that, it may not be possible to stream it without reconstructing the whole thing in memory first, because essential data structures may reside at the very end...
@schungx Yes but take in the use case of just reading the excel range into memory as vec of strings. For that it would be waste of resource to load in the styles, protections, tables etc. of that cell in the memory as it was not requested by the user. Also, the said are located in another xml file. Therefore, just retrieving the data by decoding in time is better for memory in my opinion. Entire loading is also a use case which user may specify for repeated file editing.
Perhaps flags to control what info should be loaded. That would give more flexibility.
It implements lazy_read(path) in case a large file is to be edited. This implementation allows the user to specify when to deserialize on a per-sheet basis.
// Read raw data (not deserialized)
let path = std::path::Path::new("./tests/test_files/aaa.xlsx");
let mut book = umya_spreadsheet::reader::xlsx::lazy_read(path).unwrap();
// Deserialize the first sheet
let mut sheet = book.get_sheet_mut(&0).unwrap();
A first stepping stone could be, for example, the following renovations
@MathNya, i felt lazy_read(path)
is lacking sufficient documentation for exact use case. I was using book.get_sheet(&0). unwrap()
and getting error.
Also, once the sheet is loaded that would mean things like table, protection etc are also loaded into memory props. I was meaning for that to be loaded lazily.
I am open for discussion on this issue. Thanks for responding
Or some form of a builder API:
let excel_reader = ExcelReaderBuilder::new().with_values().with_formulas().with_pictures().build();
// The following reads the Excel file Book.xlsx, but only with values, formulas and pictures.
// Other stuff like formatting, drawings, etc. are skipped (i.e. default values).
// It would be like the sheet is copied and "Paste-Special" onto a blank new sheet with only values and formulas.
let file = excel_reader.read("Book.xlsx").unwrap();
@agentjill You are right. lazy_read(path) is a bit quirky. I have felt the need to improve the documentation.
Also, once the sheet is loaded that would mean things like table, protection etc are also loaded into memory props. I was meaning for that to be loaded lazily.
It is wonderful. I agree with the lazy load subdivision.
[You can] use the parallel programming, Luke! worksheets are almost independent
pub fn read_reader<R: io::Read + io::Seek>( reader: R, with_sheetread: bool, ) -> Result<Spreadsheet, XlsxError> { ... for (, type_value, rel_target) in &workbook_rel { // can it be parallel? ... }
@mjhmjhgkmj Very nice idea. The sections you mentioned can be processed in parallel. (Perhaps there are other areas where parallel processing is possible.)
As much as I have seen I think that the entire Excel file props are loaded into memory to read and write. To optimize performance, I suggest selectively loading and manipulating data in memory rather than the entire dataset. This could boost efficiency, especially for larger files and scenarios where only specific data is modified. I think most of the workloads are based on using a limited feature set anyway.
I'd appreciate a good discussion on this proposal.