MathNya / umya-spreadsheet

A pure rust library for reading and writing spreadsheet files
MIT License
298 stars 47 forks source link

CPU Utilization and Memory Issues #233

Open BharathIO opened 1 month ago

BharathIO commented 1 month ago

Hi, while reading and writing excel file, CPU utilization is going peek to more than 100% and RAM memory consumption is going beyond 1 GB. Can you please check and do the needful.

Attaching Excel template.

Steps to reproduce:

  1. Use below attached XLSX file and read using umya_spreadsheet::reader::xlsx::lazy_read
  2. After reading, try copying row-2 and create 5 to 6 rows
  3. Observe memory and cpu utilization, is is going peek

Vendor_Details.xlsx

MathNya commented 1 month ago

@BharathIO Thank you for contacting us. We will investigate.

BharathIO commented 1 month ago

One more point i observed.

Whenever we try to create new row / modify existing cell value CPU utilization is going peek into 99% and so on.

Ex: new_book.insert_new_row("Sheet1", &3, &1);

BharathIO commented 1 month ago

I appreciate if you can share what all the functions i can use to read/write using lazy.

Ex: let mut new_book = umya_spreadsheet::reader::xlsx::lazy_read(path1).unwrap();

Above lazy_read somehow did not eat memory/ more cpu

schungx commented 1 month ago

One more point i observed.

Whenever we try to create new row / modify existing cell value CPU utilization is going peek into 99% and so on.

Ex: new_book.insert_new_row("Sheet1", &3, &1);

Is there a problem with CPU going 100%? In a compute-heavy scenario, you'd want CPU to max out in order for the result to be returned quickly... Now I don't think this repo is multi-threaded, so I assume you're talking maxing out one CPU core?

Now excessive usage of RAM is another matter. I must ask: have you compiled with release? It makes a huge difference.

Also, your XLSX is not particularly small. It may contain only one row, but it is 818KB in size, and actually contains 17528 rows which makes it larger than many large spreadsheet models. I suspect there are lots of formatting in individual cells? Can't tell without analyzing the file in depth.

So you have an 800KB file which is close to 1MB. Parsing this into a Rust structure, it is not unreasonable to expect multiple times the size of data, especially when every record is small and there are a LOT of them. I'd guess 1GB is a bit too much, but I definitely would expect at least a few hundred MB in RAM consumed just by reading the file into memory.

For example, an entry like <foo /> is only a few bytes in the Excel file, but the foo structure may contain hundreds of fields, all of which are default for this entry. The actual type in Rust would be HUGE.

Removing the extra blank rows shrinks the file down to 14KB.

BharathIO commented 1 month ago

Thank you for your update and detailed explanation.

I am currently facing an issue where CPU utilization spikes to 100%. According to the configuration, when CPU usage exceeds a threshold of 60%, the process is shut down, and an email alert is sent to the admin.

I am using the cargo build --release option for compilation. While the file size itself isn't a concern, the CPU and RAM usage during read/write operations is much higher than expected.

I will attach a sample code with CPU and memory statistics for your reference.

I understand that larger files can naturally consume more resources, but in this case, the files are relatively small. However, even with the code below, which only removes a row at index 2, CPU consumption reaches 99%.

Could you kindly provide suggestions or steps to optimize resource usage and reduce the CPU/memory footprint during file read/write operations? I am not concerned about increasing the time it takes for these operations if that helps in reducing resource consumption.

let template_path = "../lov-template-new.xlsx";
    let output_path = ../1_style_issue_output.xlsx";

let  mut book = match umya_spreadsheet::reader::xlsx::read(template_path) {
        Ok(book) => book,
        Err(e) => {
            println!("Error: {:?}", e);
            return;
        }
    };

if let Some(sheet) =  book.get_sheet_mut(&1) {
    let row_index = 2;
    sheet.remove_row(&row_index, &1);
    println!("Removed row: {}", row_index);
   }

match umya_spreadsheet::writer::xlsx::write(&book, output_path) {
        Ok(_) => println!("Success writing to file: {}", output_path),
        Err(e) => println!("Error: {:?}", e),
    };
Screenshot 2024-10-09 at 3 33 37 PM
schungx commented 1 month ago

Must you keep all those blank rows?

Or can you use column styles instead?

schungx commented 1 month ago

I would suggest reducing the size of the xlsx files. Excel files are not the easiest to parse and this repo does a very thorough job in exposing very detailed structure. You have to give it some slack by helping it with less work.

BharathIO commented 1 month ago

I appreciate the power of this repo, it has all the features i wanted so far.

Can you please help me how to reduce the size of xlsx without losing any data and styles?

also, i saw that lazy_read and write_light functions are very helpful. An you suggest such functions so that i can use for my functionality?

schungx commented 1 month ago

I would start by finding out why it is so large in the first place. You most likely duplicated formatting styles over all 17000+ rows.

Maybe you should just make some of the column styles and they'll be in effect for all cells in that column.

EDIT: Strange, your XLSX file's data is 6MB (!!!) uncompressed. However, I don't see any specific formatting in any of those cells.

schungx commented 1 month ago

Vendor_Details.xlsx

Try this one instead.

BharathIO commented 1 month ago

Thanks, i will try it and let you know with my results.

May i know how you compressed it without losing styles and formattings etc., As per my understanding, new file your shared has below. Highest column - 13, Highest row - 2

All the blank rows are gone now. Please share the steps to do this for any other files in future.

schungx commented 1 month ago

All the blank rows are gone now. Please share the steps to do this for any other files in future.

Simple. Select all the blank columns up to the right end, delete columns. Select first blank column, press Shift-Ctrl-Right-Arrow to select all the columns. Then Edit-Delete.

Select all the blank rows down to the lowest bottom, deleted rows. Select first blank row, press Shift-Ctrl-Down-Arrow to select all rows. Then Edit-Delete.

BharathIO commented 1 month ago

Thanks for the update. I am aware of manual steps to remove blank rows. Is there any way programmatically remove them?

schungx commented 1 month ago

Thanks for the update. I am aware of manual steps to remove blank rows. Is there any way programmatically remove them?

There is API to delete rows and columns...

BharathIO commented 1 month ago

Thanks for the update. I am aware of manual steps to remove blank rows. Is there any way programmatically remove them?

There is API to delete rows and columns...

Is it sheet.remove_row() you are talking about?

I mean, i need a way to remove/delete all empty rows for given input file programmatically with rust.

schungx commented 1 month ago

I mean, i need a way to remove/delete all empty rows for given input file programmatically with rust.

I believe you can loop through the Excel sheet data structures and simply delete all cell-specific data if the cell is empty. However that'll require iterating over a huge number of cells. I would not recommend it.

You are not doing simple manipulation here. You're rewriting Excel.

Alternatively, if all your rows are guaranteed to contain all data or all blanks, then just search downwards until you find a blank row, then remove rows up to max in one single call.

BharathIO commented 1 month ago

Thanks for the info. I am trying couple of ways and got few questions.

Can you please explain the performance differences between

use umya_spreadsheet::{reader::xlsx::lazy_read, writer::xlsx::write_light};

  1. read vs lazy_read
  2. write vs write_light
MathNya commented 1 month ago

@BharathIO read reads a file and then deserializes it. lazy_read only reads the file. lazy_read does not deserialize, so it is limited in what it can do, but is faster.

The difference between write and write_light is in the compression format. write_light is faster than write, but increases the file size of the generated files.

MathNya commented 1 month ago

I mean, i need a way to remove/delete all empty rows for given input file programmatically with rust.

I believe you can loop through the Excel sheet data structures and simply delete all cell-specific data if the cell is empty. However that'll require iterating over a huge number of cells. I would not recommend it.

You are not doing simple manipulation here. You're rewriting Excel.

Alternatively, if all your rows are guaranteed to contain all data or all blanks, then just search downwards until you find a blank row, then remove rows up to max in one single call.

We have prepared a function cleanup() that handles this process. However, there is still some difficulty in processing speed.

BharathIO commented 1 month ago

Great. I will definitely use cleanup for my use case. May i know how long it took to cleanup all empty rows (around 17K) in the file which i shared?

MathNya commented 1 month ago

@BharathIO

Vendor_Details.xlsx

Try this one instead.

We performed a cleanup on this file. It was completed in an instant.

schungx commented 1 week ago

I would like everybody to try out my PR on: https://github.com/MathNya/umya-spreadsheet/pull/242

Let me know if memory requirements drop.

BharathIO commented 1 week ago

I would like everybody to try out my PR on: #242

Let me know if memory requirements drop.

Great, i could see a bigger change now in terms of Memory & CPU utilization. I will validate few more use cases and post my observations here.