Closed analyticd closed 6 years ago
Hrmm... I think that perhaps map-fn got pushed forward in the processing to before skip row was applied. https://github.com/AccelerationNet/cl-csv/blob/master/parser.lisp#L189
This was not necessarily intentional as the whole process was rewritten. I think my logic was something along the lines of the map function being able to set it to skip the row on certain data conditions. I was thinking of using map to convert the data in process (eg change data types) and sometimes skip the whole row on some condition (EG: data type conversion failed). I can see an argument either way and am not sure what makes the most sense there. The row-fn and the data collection take place after the skip-row.
What do you think, I agree that the current situation is less intuitive, but was hoping maybe it provided slightly more flexibility, maybe not though.
@bobbysmith007 , I don't mind the evolution of the library. I actually didn't know about :row-fn, so I am fine with using that instead and I see it does give the behavior I want. Thanks for your response.
I'll just mention that I did end up going back to a version of cl-csv from quicklisp dist 2017-05-16 because I found that it had dramatically faster perceived (non-measured) execution, seemingly two to three times faster at least. So maybe along the way new additions since last May have really slowed down csv-read with a :map-fn function (back then :map-fn respected :skip-first-p so I just stuck with that instead of :row-fn which I used in the newest cl-csv test)? Thought you might want to know. I have a performance sensitive production environment.
Interesting, I actually thought my new parser would be more general, but slower. However, when I tested it, it was faster than my older "optimized" way. Originally I had it flipping implementations based on which parser would be faster, but it seemed the new one was always faster (on recent sbcl), and so I just retired the old parser.
I will run my speed tests again in-case I had some debug setting wrong that changed the performance.
The new parser is much easier to work on and much more flexible, so it would be nice if we could get that behavior without a performance penalty. (The new parser was written to accommodate multi-character data separators and more flexible line-ending handling)
When I was optimizing this 5 years ago, it seemed that read-line / read buffer was much faster than reading by character. When I just recently tested the new parser these seemed to be about the same speed, perhaps this is the implementation difference I am seeing from you.
Looking further and running some speed tests in a new version in recent SBCL everything seems like the new version is on par or faster than the old one (which was harder to understand, read, and debug).
SPEED-TESTS> (collect-big-file-csv-rows-old)
10:55:58 BEGIN read-csv-old large file test
10:56:04 END read-csv-old large file test
Evaluation took:
6.090 seconds of real time
6.092000 seconds of total run time (6.060000 user, 0.032000 system)
[ Run times consist of 0.076 seconds GC time, and 6.016 seconds non-GC time. ]
100.03% CPU
20,660,760,770 processor cycles
101,775,344 bytes consed
NIL
SPEED-TESTS> (collect-big-file-csv-rows)
10:56:07 BEGIN read-csv large file test
10:56:13 END read-csv large file test
Evaluation took:
6.130 seconds of real time
6.132000 seconds of total run time (6.024000 user, 0.108000 system)
[ Run times consist of 0.216 seconds GC time, and 5.916 seconds non-GC time. ]
100.03% CPU
20,796,471,968 processor cycles
89,034,240 bytes consed
SPEED-TESTS> (read-by-line-and-buffer-and-char)
;;; Just reading a big file
10:59:20 BEGIN read large file by lines
10:59:20 END read large file by lines
Evaluation took:
0.235 seconds of real time
0.232000 seconds of total run time (0.228000 user, 0.004000 system)
[ Run times consist of 0.052 seconds GC time, and 0.180 seconds non-GC time. ]
98.72% CPU
797,098,094 processor cycles
50,999,696 bytes consed
10:59:20 BEGIN read large file by buffer
10:59:21 END read large file by buffer
Evaluation took:
0.075 seconds of real time
0.076000 seconds of total run time (0.072000 user, 0.004000 system)
101.33% CPU
255,064,516 processor cycles
0 bytes consed
10:59:21 BEGIN read large file by read-into-buffer-until
10:59:21 END read large file by read-into-buffer-until
Evaluation took:
0.198 seconds of real time
0.200000 seconds of total run time (0.196000 user, 0.004000 system)
101.01% CPU
669,845,392 processor cycles
0 bytes consed
10:59:21 BEGIN read large file by char
10:59:21 END read large file by char
Evaluation took:
0.186 seconds of real time
0.184000 seconds of total run time (0.184000 user, 0.000000 system)
98.92% CPU
631,960,580 processor cycles
0 bytes consed
lines:120001 , buffers:133502, buffered-lines:120001, chars:10680089
I would be curious to know what your implementation is and what the speed tests report for you. I pushed a patch to make the speed tests more readily runnable by others.
(ql:quickload :cl-csv/speed-test)
and then (cl-csv-test.speed-tests::run-speed-tests)
My supposition however is that one of the underlying mechanism (eg: reading by char vs buffer) is what is causing this slow down.
If it turns out that read-char is suspect in other implementations, I have added a brancht hat uses read-sequence instead, that may exhibit better performance.
https://github.com/AccelerationNet/cl-csv/tree/buffered-reads-for-parser
Thanks so much for taking a look at this @bobbysmith007, I hope to get a chance to look at the speed test here locally very soon, then I will report back. Appreciate the work you are doing.
I am using (need to use) CCL for this project. Changing quicklisp dists will change the version of all libraries used in the project, so it could just be that between 2016-02-08 and 2018-01-31 changed one of the other libraries I am using in my project to a much slower version: cl-ppcre, parse-number, or, uiop.
In the context of quicklisp dist 2018-01-31, when I pull the latest from cl-csv git repo, then (load "/path/to/cl-csv.asd"), then (ql:quickload :cl-csv) it barfs on some cl-unicode package errors. Not sure how to proceed to try the speed test in (ql:quickload :cl-csv/speed-test), which of course also doesn't load if cl-csv doesn't. Anyway, maybe it is enough for you to know that I am using CCL.
Hrmm... I wasnt aware I was loading cl-unicode somewhere down the way, so definitely dont have assistance on that one. I found that its loaded through cl-interpol, which is probably not strictly required (just for convenience).
I may just try to install CCL and see if I can find what's happening. Or I may drop cl-interpol as a dependency as its not really necessary only nice convenient.
Thanks for the effort, I will report back with my eventual findings such as they be
Also, which version of CCL?
1.11.5 64-bit
I was able to run and execute the cl-csv speed tests on ccl 1.11.5 64-bit as downloaded from their webpage and executing the binary versions. I did not experience the cl-unicode issues you did, so maybe you have a compilation cache that needs to be cleared on that front.
My tests showed CCL distinctly slower at this task than SBCL (34s vs 6s) for reading the whole large CSV. That said the old implementation was not appreciably different in speed (34.2s vs 33.8s).
That said, I am sorry your work load experienced a slow down. I am on Ubuntu linux16.04 - 64bit if that makes a difference. Not sure how to approach fixing it if I cant find an appreciable difference.
17:42:43 BEGIN read large file by lines
17:42:44 END read large file by lines
(PROGN (LET (LINE)
(CL-CSV::WITH-CSV-INPUT-STREAM
(S +TEST-BIG-FILE+)
(HANDLER-CASE (LOOP WHILE (SETF LINE (READ-LINE S))
DO (INCF CNT))
(END-OF-FILE (C) (DECLARE (IGNORE C)))))))
took 497,846 microseconds (0.497846 seconds) to run.
23,767 microseconds (0.023767 seconds, 4.77%) of which was spent in GC.
During that period, and with 2 available CPU cores,
495,798 microseconds (0.495798 seconds) were spent in user mode
3,991 microseconds (0.003991 seconds) were spent in system mode
44,164,720 bytes of memory allocated.
17:42:44 BEGIN read large file by buffer
17:42:44 END read large file by buffer
(PROGN (CL-CSV::WITH-CSV-INPUT-STREAM
(S +TEST-BIG-FILE+)
(ITER (WITH BUFFER = (MAKE-ARRAY 80 :ELEMENT-TYPE 'CHARACTER))
(WITH FILL)
(HANDLER-CASE (SETF FILL (READ-SEQUENCE BUFFER S))
(END-OF-FILE NIL (FINISH)))
(INCF CNT2)
(WHILE (= 80 FILL)))))
took 239,816 microseconds (0.239816 seconds) to run.
During that period, and with 2 available CPU cores,
241,084 microseconds (0.241084 seconds) were spent in user mode
19 microseconds (0.000019 seconds) were spent in system mode
4,144 bytes of memory allocated.
17:42:44 BEGIN read large file by read-into-buffer-until
17:42:44 END read large file by read-into-buffer-until
(PROGN (CL-CSV::WITH-CSV-INPUT-STREAM
(S +TEST-BIG-FILE+)
(LET ((BUFFER (MAKE-STRING CL-CSV::*BUFFER-SIZE*)))
(HANDLER-CASE (LOOP WHILE (PLUSP (CL-CSV::READ-INTO-BUFFER-UNTIL
BUFFER
S))
DO (INCF CNT3))
(END-OF-FILE (C) (DECLARE (IGNORE C)))))))
took 384,333 microseconds (0.384333 seconds) to run.
During that period, and with 2 available CPU cores,
370,654 microseconds (0.370654 seconds) were spent in user mode
7,943 microseconds (0.007943 seconds) were spent in system mode
5,744 bytes of memory allocated.
17:42:44 BEGIN read large file by char
17:42:45 END read large file by char
(PROGN (CL-CSV::WITH-CSV-INPUT-STREAM
(S +TEST-BIG-FILE+)
(LET ((C))
(HANDLER-CASE (LOOP WHILE (SETF C (READ-CHAR S))
DO (INCF CNT4))
(END-OF-FILE (C) (DECLARE (IGNORE C)))))))
took 287,511 microseconds (0.287511 seconds) to run.
During that period, and with 2 available CPU cores,
280,574 microseconds (0.280574 seconds) were spent in user mode
7,946 microseconds (0.007946 seconds) were spent in system mode
3,680 bytes of memory allocated.
lines:120001 , buffers:133502, buffered-lines:120001, chars:10680089
CL-CSV-TEST.SPEED-TESTS::READ-BY-LINE-AND-BUFFER-AND-CHAR - WARNINGS (1.43s) : 0 assertions passed
| WARNINGS (1)
| WARNING: In CL-CSV::BODY inside an anonymous lambda form inside READ-BY-LINE-AND-BUFFER-AND-CHAR-TEST-THUNK: Unused lexical variable C
| #<STYLE-WARNING #x302001DF844D>
|
|
Starting: CL-CSV-TEST.SPEED-TESTS::COLLECT-BIG-FILE-CSV-ROWS
17:42:45 BEGIN read-csv large file test
17:43:19 END read-csv large file test
(PROGN (READ-CSV +TEST-BIG-FILE+))
took 34,273,467 microseconds (34.273468 seconds) to run.
485,037 microseconds ( 0.485037 seconds, 1.42%) of which was spent in GC.
During that period, and with 2 available CPU cores,
33,935,609 microseconds (33.935608 seconds) were spent in user mode
111,776 microseconds ( 0.111776 seconds) were spent in system mode
75,590,032 bytes of memory allocated.
7,156 minor page faults, 0 major page faults, 0 swaps.
CL-CSV-TEST.SPEED-TESTS::COLLECT-BIG-FILE-CSV-ROWS - EMPTY (34.28s) : 0 assertions passed
Starting: CL-CSV-TEST.SPEED-TESTS::COLLECT-BIG-FILE-CSV-ROWS-OLD
17:43:19 BEGIN read-csv-old large file test
17:43:53 END read-csv-old large file test
(PROGN (CL-CSV::READ-CSV-OLD +TEST-BIG-FILE+))
took 33,803,458 microseconds (33.803455 seconds) to run.
506,981 microseconds ( 0.506981 seconds, 1.50%) of which was spent in GC.
During that period, and with 2 available CPU cores,
33,465,215 microseconds (33.465214 seconds) were spent in user mode
91,655 microseconds ( 0.091655 seconds) were spent in system mode
80,648,736 bytes of memory allocated.
9,373 minor page faults, 0 major page faults, 0 swaps.
Hope you can get everything into a stable working condition at an appropriate speed
@bobbysmith007 Thank you so much for testing it. As I mentioned before, I downgraded to an earlier quicklisp dist and that was fine in production and is working there now without issue. As I mentioned, it could be that one of those other libraries I listed is significantly slower in the 2018-01-31 quicklisp dist and not cl-csv, I don't know. It could also be that using :row-fn is slower than using the old :map-fn from 2016. Either way, I delivered my update into production for the client and all is well. Thanks again. Appreciate the availability of cl-csv and your work on it.
Not sure if this is a bug or just a misunderstanding on my part, but using an earlier version from quicklisp (perhaps 2016-ish?, not sure), this used to work as I expected and now doesn't using either the quicklisp 2018-01-31 dist or the latest (as of today) from github. Here is a test case to demonstrate:
I was expecting output of:
=> (4 5 6)
since :skip-first-p was set to t.