Optimization. - Githubissues

sirherrbatka commented 5 years ago

Hello,

I noticed that cl-csv is slow when compared to the standard parser in python (and I have some jumbo sized CSV to parse so it hurts). I decided to attempt to profile and localize the issues that have sizeable impact on the runtime performance.

I discovered that problems stem from the read-dispatch-table-entry defclass. Since parsing involves calling generic accessors multiple times for every single character i decided to give it a shot and replace defclass with defstruct, and consequently all calls to generic accessors to slots of this class with struct accessors. I added few declaim inlines for internal functions and replaced few repeated calls to other accessors. Those changes yielded ~×3 runtime improvement when benchmarked on my machine. I don't see any flexibility drawbacks when doing this. Furthermore, I believe that even more improvement is possible when applying the same treatment to the read-dispatch-table defclass.

Are those changes acceptable design-wise? If yes, I will open pull request with those.

bobbysmith007 commented 5 years ago

Thanks so much for your interest and care in this topic, I hope we can come together to make cl-csv a better experience.

I have minor ideological problems with the change, because I much prefer classes to structs for a number of relatively unimportant reasons. I don't know if this would cause problems for any other libraries or not (as I don't maintain those). I have used the streaming interface of cl-csv (the one row at a time version of things), to process multiple gigabyte csv's in a reasonably timely manner (EG: the slow part was putting the data into the database and not the parsing of the CSV).

I believe PGLoader is a fairly extensive library built on cl-csv and I wouldn't want to force them to rewrite their library if it could be helped.

The goals of cl-csv at the time of writing were as follows:

A streaming interface to files so they could be processed a line at a time
error recovery so that a bad line or datum can be skipped and we can hopefully continue processing
error messages that detail exactly where and when and what data caused the issue.
Forgiving when reading, exact when writing
Must handle my 3 gig csvs without crashing

These are more important to me personally than it being the fastest library, but provided these don't change, faster is better.

Which CL implementation are you using? There should be some optimizations that can be made without removing classes for structs. I believe when testing (3-5 years ago) that I found that in SBCL while the struct calls were something like twice as fast as unspecialized method calls (methods where there is only one implementation without any dispatch rules), all method call times were dwarfed by the IO calls between disks and databases.

As implementations change over time and different code paths get optimized the optimizations that can be made in a library change, so I don't have anything definitive to say about the current optimzation status of cl-csv - only that it has been worked on for a decade and that we haven't had to remove the classes in favor of structs yet.

If you have already made the changes I would definitely be interested in looking at them as that would give me a better idea of how this might effect other applications. I would also be interested in which speed tests you are running, on what files sizes and how this translates into user runtime. I would not have expected a 3x speed improvement simply from changing from classes to structs. (Previously there was a lot of work optimizing the IO because SBCL provided relatively slow IO constructs - but these have been constantly improving)

Finally, I am not likely to have much chance to work on my Common Lisp libraries moving into the future. I have recently changed jobs, and am no longer in the position where I have time or need to maintain these libraries. I will do my best to keep things moving forward, but you may be better of branching cl-csv into cl-csv2 or something and moving forward from there without being concerned about backward compatibility and dealing with a very part-time maintainer.

Thanks again!

sirherrbatka commented 5 years ago

Hello, I am working with sbcl 1.4.16.

I indeed implemented changes already. The following code: (time (with-open-file (stream "~/test-data.csv") (iterate (for row = (cl-csv:read-csv-row stream)) (while (peek-char t stream nil nil)))))

will take 23.291988 seconds of total run time (23.268166 user, 0.023822 system). After changes i described in the issue the same code executes in 6.328420 seconds of total run time (6.298635 user, 0.029785 system) (still slow when compared to reader included in the python, but obviously closer). Hence reported fraction of 3 improvements.

I am working on a data frame library for CL, and I would like to use CL-CSV for parsing tables stored as CSV files. I will open the pull request for your inspection and think what to do next.

Test file is about 70 megabytes in size.

bobbysmith007 commented 5 years ago

I am currently running 1.4.2 and I have to think something changed in terms of SBCL, this library and the optimizations attempted previously. The last record of running my "read large file" speed test showed me reading the file in 1.5 sec of real time. It now takes 11.5 sec to read the same file on the same computer (with I believe the same version of the software, but different SBCLs).

I'll take a look at your version when I have a moment, but it may be a bit.

sirherrbatka commented 5 years ago

OK. I don't have any idea what could possible change in the SBCL.

AccelerationNet / cl-csv

Optimization. #32