PHPOffice / PHPExcel

ARCHIVED
Other
11.46k stars 4.2k forks source link

Simple optimisation for writing string cells #533

Open chrisgraham opened 9 years ago

chrisgraham commented 9 years ago

There's no point taking substrings and doing complex replaces for rare events. This is a simple optimisation:

diff --git a/sources_custom/PHPExcel/Cell/DataType.php b/sources_custom/PHPExcel/Cell/DataType.php
index 8576542..50d930e 100644
--- a/sources_custom/PHPExcel/Cell/DataType.php
+++ b/sources_custom/PHPExcel/Cell/DataType.php
@@ -93,11 +93,15 @@ class PHPExcel_Cell_DataType
             return $pValue;
         }

-        // string must never be longer than 32,767 characters, truncate if necessary
-        $pValue = PHPExcel_Shared_String::Substring($pValue, 0, 32767);
+        if (strlen($pValue) > 32767) {
+            // string must never be longer than 32,767 characters, truncate if necessary
+            $pValue = PHPExcel_Shared_String::Substring($pValue, 0, 32767);
+        }

-        // we require that newline is represented as "\n" in core, not as "\r\n" or "\r"
-        $pValue = str_replace(array("\r\n", "\r"), "\n", $pValue);
+        if (strpos($pValue, "\r") !== false) {
+            // we require that newline is represented as "\n" in core, not as "\r\n" or "\r"
+            $pValue = str_replace(array("\r\n", "\r"), "\n", $pValue);
+        }

         return $pValue;
     }

Side note:

I was trying to save memory when using PHPExcel, but regrettably I think this is not going to work for me. I've got a client on a shared server and 1000s of rows and about 100 columns, and with each cell being an object with a number of properties, even with the cache (which I don't think works for writes, only reads - according to my debug tests) it's not gonna help much. I run out of memory on the server after just a few hundred rows :(.

I think PHPExcel could use some really heavy optimisation. If a cell is just a simple scalar value with no styling, I'd recommend having no Cell object at all, just store it as a scalar. This would have enormous benefit. You might have to do some sneakiness in the API, e.g. if it is requested as a cell, turn it into a temporary standalone object that references into that scalar value, and implants itself if modified.

MarkBaker commented 9 years ago

Your first change will cause problems with non-ASCII string values (ie UTF-8) because strlen is counting bytes, not characters

Is an if test faster than a simple str_replace?

Storing a cell as a scalar might work, except that every cell has styling, even if it's just the default styling.... cache works for both reads and writes, because it's read/write agnostic.... but you're quite welcome to fork

chrisgraham commented 9 years ago

Non-ASCII strings are always gonna use more bytes though, so the optimisation should be ok I think. The optimisation might not kick in, causing an unnecessary substring call anyway, but not introduce any actual bug. Correct me if I'm wrong, I do have a headache to day so who knows ;).

str_replace in PHP will always rewrite the string I think, it's pretty inefficient. strpos will do a simpler scan. Never over-estimate the smartness of PHP, if you look at the C code you'll see it rarely does clever stuff, and typically has various levels of abstraction making things worse.

If it's just default styling on a cell, the temporary cell I mentioned above would be able to pull that out. If the styling is then made explicit, the object of that temporary cell can then modify the sheet to insert an actual cell object with that actual style. I know it's a tricky thing to do, but benefit would be enormous.

I tested setting my cell, then doing a var_dump of the object. I didn't see any references to caching, I saw the explicit value there. I couldn't see how the code could be doing it either. However, I did find the code a bit confusing, so I may have missed it. Perhaps it is changing what is in the sheet but leaving the cell object.

MarkBaker commented 9 years ago

If you var_dump a cell, you should see a parent property which references the cell cache for that worksheet.... but you only have a single active cell in the cell collection at any one time

MarkBaker commented 9 years ago

But I'm not going to make performance optimisations without decent performance tests to prove that they will make things faster.... I get enough abuse over performance as it is (some of it extremely personal) so I'm not prepared to simply accept unproven changes that could s easily make things slower as faster