How to read large files? Is there a way to read them in a streaming?

MathNya / umya-spreadsheet

A pure rust library for reading and writing spreadsheet files

MIT License

299 stars 47 forks source link

How to read large files? Is there a way to read them in a streaming? #237

Open Zzaniu opened 4 weeks ago

Zzaniu commented 4 weeks ago

Big file read, then memory burst

let mut book = umya_spreadsheet::reader::xlsx::lazy_read(self.file_path.as_ref())?;
let cell = book
            .get_lazy_read_sheet_cells(&sheet_index)
            .map_err(|e| anyhow!("{e}"))?;
let v = cell.get_cell_value((1, index)).get_value();
...


memory allocation of 17867735056 bytes failed

Zzaniu commented 4 weeks ago

I solved this problem with calamine, but calamine can only be read and not written. I really hope that umya_spreadsheet will solve this problem

MathNya commented 3 weeks ago

@Zzaniu Thank you for contacting us. We are sorry, but we may not be able to meet your expectations. We are aware that umya-spreadsheet consumes more memory than other libraries. However, we have not found a solution at this time. I think a solution will be quite a while away.

BharathIO commented 3 weeks ago

@MathNya I am also looking for same. Any work arounds at this moment to support streaming? Appreciate your response

I specifically need Streaming while writing rows to Excel workbook sheet,

MathNya commented 3 weeks ago

@BharathIO When umya-spreadsheet updates a cell, it deserializes all cells in the sheet. Because of this implementation, it is not currently possible to achieve the expected behavior.

BharathIO commented 2 weeks ago

Ok. Thanks for the update @MathNya .

When i am writing around 17k records into sheet, i observed it is consuming more memory. Any way i can do to use it less memory? Please share your thoughts

I observed it consumed 800 to 900 MB of RAM while writing 17k records

BharathIO commented 2 weeks ago

Is there any way to write custom serializer/deserializer for my usecase to process 17k records?

schungx commented 2 weeks ago

17k records in memory is probably going to take large amounts of RAM by themselves.

BharathIO commented 2 weeks ago

But when i observed, while using other libraries like xlsxwriter or so, it did not consume large amount of RAM.

umya-spreadsheet library has lot more features compared to other libraries, only issue is with large amount of RAM consumption and high CPU Utilization. Any workarounds at this moment?

schungx commented 2 weeks ago

umya-spreadsheet library has lot more features compared to other libraries, only issue is with large amount of RAM consumption and high CPU Utilization. Any workarounds at this moment?

I believe there are venues to reduce the memory footprint of many data types, but essentially more features = more data types to keep track of. Therefore it is not always avoidable.

schungx commented 1 week ago

Try my PR to see how much it reduces...

https://github.com/MathNya/umya-spreadsheet/pull/242

BharathIO commented 1 week ago

Try my PR to see how much it reduces...

242

Great, i could see a bigger change now in terms of Memory & CPU utilization. I will validate few more use cases and post my observations here.

schungx commented 2 days ago

@BharathIO would be interested to know the memory usage in the new version.