cl-csv:read-csv not skipping header when :skip-first-p t and :map-fn lambda provided

analyticd commented 6 years ago

Not sure if this is a bug or just a misunderstanding on my part, but using an earlier version from quicklisp (perhaps 2016-ish?, not sure), this used to work as I expected and now doesn't using either the quicklisp 2018-01-31 dist or the latest (as of today) from github. Here is a test case to demonstrate:

(cl-csv:read-csv "col1,col2,col3
4,5,6" :skip-first-p t :map-fn #'(lambda (row) (princ row)))
=> (col1 col2 col3)(4 5 6)
(("4" "5" "6"))

I was expecting output of:

=> (4 5 6)

since :skip-first-p was set to t.

bobbysmith007 commented 6 years ago

Hrmm... I think that perhaps map-fn got pushed forward in the processing to before skip row was applied. https://github.com/AccelerationNet/cl-csv/blob/master/parser.lisp#L189

This was not necessarily intentional as the whole process was rewritten. I think my logic was something along the lines of the map function being able to set it to skip the row on certain data conditions. I was thinking of using map to convert the data in process (eg change data types) and sometimes skip the whole row on some condition (EG: data type conversion failed). I can see an argument either way and am not sure what makes the most sense there. The row-fn and the data collection take place after the skip-row.

What do you think, I agree that the current situation is less intuitive, but was hoping maybe it provided slightly more flexibility, maybe not though.

analyticd commented 6 years ago

@bobbysmith007 , I don't mind the evolution of the library. I actually didn't know about :row-fn, so I am fine with using that instead and I see it does give the behavior I want. Thanks for your response.

analyticd commented 6 years ago

I'll just mention that I did end up going back to a version of cl-csv from quicklisp dist 2017-05-16 because I found that it had dramatically faster perceived (non-measured) execution, seemingly two to three times faster at least. So maybe along the way new additions since last May have really slowed down csv-read with a :map-fn function (back then :map-fn respected :skip-first-p so I just stuck with that instead of :row-fn which I used in the newest cl-csv test)? Thought you might want to know. I have a performance sensitive production environment.

bobbysmith007 commented 6 years ago

Interesting, I actually thought my new parser would be more general, but slower. However, when I tested it, it was faster than my older "optimized" way. Originally I had it flipping implementations based on which parser would be faster, but it seemed the new one was always faster (on recent sbcl), and so I just retired the old parser.

I will run my speed tests again in-case I had some debug setting wrong that changed the performance.

The new parser is much easier to work on and much more flexible, so it would be nice if we could get that behavior without a performance penalty. (The new parser was written to accommodate multi-character data separators and more flexible line-ending handling)

When I was optimizing this 5 years ago, it seemed that read-line / read buffer was much faster than reading by character. When I just recently tested the new parser these seemed to be about the same speed, perhaps this is the implementation difference I am seeing from you.

bobbysmith007 commented 6 years ago

Looking further and running some speed tests in a new version in recent SBCL everything seems like the new version is on par or faster than the old one (which was harder to understand, read, and debug).

SPEED-TESTS> (collect-big-file-csv-rows-old)

10:55:58 BEGIN read-csv-old large file test
10:56:04   END read-csv-old large file test
Evaluation took:
  6.090 seconds of real time
  6.092000 seconds of total run time (6.060000 user, 0.032000 system)
  [ Run times consist of 0.076 seconds GC time, and 6.016 seconds non-GC time. ]
  100.03% CPU
  20,660,760,770 processor cycles
  101,775,344 bytes consed

NIL
SPEED-TESTS> (collect-big-file-csv-rows)
10:56:07 BEGIN read-csv large file test
10:56:13   END read-csv large file test
Evaluation took:
  6.130 seconds of real time
  6.132000 seconds of total run time (6.024000 user, 0.108000 system)
  [ Run times consist of 0.216 seconds GC time, and 5.916 seconds non-GC time. ]
  100.03% CPU
  20,796,471,968 processor cycles
  89,034,240 bytes consed

SPEED-TESTS> (read-by-line-and-buffer-and-char)
;;; Just reading a big file

10:59:20 BEGIN read large file by lines
10:59:20   END read large file by lines
Evaluation took:
  0.235 seconds of real time
  0.232000 seconds of total run time (0.228000 user, 0.004000 system)
  [ Run times consist of 0.052 seconds GC time, and 0.180 seconds non-GC time. ]
  98.72% CPU
  797,098,094 processor cycles
  50,999,696 bytes consed

10:59:20 BEGIN read large file by buffer
10:59:21   END read large file by buffer
Evaluation took:
  0.075 seconds of real time
  0.076000 seconds of total run time (0.072000 user, 0.004000 system)
  101.33% CPU
  255,064,516 processor cycles
  0 bytes consed

10:59:21 BEGIN read large file by read-into-buffer-until
10:59:21   END read large file by read-into-buffer-until
Evaluation took:
  0.198 seconds of real time
  0.200000 seconds of total run time (0.196000 user, 0.004000 system)
  101.01% CPU
  669,845,392 processor cycles
  0 bytes consed

10:59:21 BEGIN read large file by char
10:59:21   END read large file by char
Evaluation took:
  0.186 seconds of real time
  0.184000 seconds of total run time (0.184000 user, 0.000000 system)
  98.92% CPU
  631,960,580 processor cycles
  0 bytes consed

 lines:120001 , buffers:133502, buffered-lines:120001, chars:10680089

I would be curious to know what your implementation is and what the speed tests report for you. I pushed a patch to make the speed tests more readily runnable by others. (ql:quickload :cl-csv/speed-test) and then (cl-csv-test.speed-tests::run-speed-tests)

My supposition however is that one of the underlying mechanism (eg: reading by char vs buffer) is what is causing this slow down.

bobbysmith007 commented 6 years ago

If it turns out that read-char is suspect in other implementations, I have added a brancht hat uses read-sequence instead, that may exhibit better performance.

https://github.com/AccelerationNet/cl-csv/tree/buffered-reads-for-parser

analyticd commented 6 years ago

Thanks so much for taking a look at this @bobbysmith007, I hope to get a chance to look at the speed test here locally very soon, then I will report back. Appreciate the work you are doing.

analyticd commented 6 years ago

I am using (need to use) CCL for this project. Changing quicklisp dists will change the version of all libraries used in the project, so it could just be that between 2016-02-08 and 2018-01-31 changed one of the other libraries I am using in my project to a much slower version: cl-ppcre, parse-number, or, uiop.

In the context of quicklisp dist 2018-01-31, when I pull the latest from cl-csv git repo, then (load "/path/to/cl-csv.asd"), then (ql:quickload :cl-csv) it barfs on some cl-unicode package errors. Not sure how to proceed to try the speed test in (ql:quickload :cl-csv/speed-test), which of course also doesn't load if cl-csv doesn't. Anyway, maybe it is enough for you to know that I am using CCL.

bobbysmith007 commented 6 years ago

Hrmm... I wasnt aware I was loading cl-unicode somewhere down the way, so definitely dont have assistance on that one. I found that its loaded through cl-interpol, which is probably not strictly required (just for convenience).

I may just try to install CCL and see if I can find what's happening. Or I may drop cl-interpol as a dependency as its not really necessary only nice convenient.

Thanks for the effort, I will report back with my eventual findings such as they be

bobbysmith007 commented 6 years ago

Also, which version of CCL?

analyticd commented 6 years ago

1.11.5 64-bit

bobbysmith007 commented 6 years ago

I was able to run and execute the cl-csv speed tests on ccl 1.11.5 64-bit as downloaded from their webpage and executing the binary versions. I did not experience the cl-unicode issues you did, so maybe you have a compilation cache that needs to be cleared on that front.

My tests showed CCL distinctly slower at this task than SBCL (34s vs 6s) for reading the whole large CSV. That said the old implementation was not appreciably different in speed (34.2s vs 33.8s).

That said, I am sorry your work load experienced a slow down. I am on Ubuntu linux16.04 - 64bit if that makes a difference. Not sure how to approach fixing it if I cant find an appreciable difference.

17:42:43 BEGIN read large file by lines
17:42:44   END read large file by lines
(PROGN (LET (LINE)
         (CL-CSV::WITH-CSV-INPUT-STREAM
           (S +TEST-BIG-FILE+)
           (HANDLER-CASE (LOOP WHILE (SETF LINE (READ-LINE S))
                               DO (INCF CNT))
                               (END-OF-FILE (C) (DECLARE (IGNORE C)))))))
took 497,846 microseconds (0.497846 seconds) to run.
      23,767 microseconds (0.023767 seconds, 4.77%) of which was spent in GC.
During that period, and with 2 available CPU cores,
     495,798 microseconds (0.495798 seconds) were spent in user mode
       3,991 microseconds (0.003991 seconds) were spent in system mode
 44,164,720 bytes of memory allocated.

17:42:44 BEGIN read large file by buffer
17:42:44   END read large file by buffer
(PROGN (CL-CSV::WITH-CSV-INPUT-STREAM
         (S +TEST-BIG-FILE+)
         (ITER (WITH BUFFER = (MAKE-ARRAY 80 :ELEMENT-TYPE 'CHARACTER))
               (WITH FILL)
               (HANDLER-CASE (SETF FILL (READ-SEQUENCE BUFFER S))
                             (END-OF-FILE NIL (FINISH)))
               (INCF CNT2)
               (WHILE (= 80 FILL)))))
took 239,816 microseconds (0.239816 seconds) to run.
During that period, and with 2 available CPU cores,
     241,084 microseconds (0.241084 seconds) were spent in user mode
          19 microseconds (0.000019 seconds) were spent in system mode
 4,144 bytes of memory allocated.

17:42:44 BEGIN read large file by read-into-buffer-until
17:42:44   END read large file by read-into-buffer-until
(PROGN (CL-CSV::WITH-CSV-INPUT-STREAM
         (S +TEST-BIG-FILE+)
         (LET ((BUFFER (MAKE-STRING CL-CSV::*BUFFER-SIZE*)))
           (HANDLER-CASE (LOOP WHILE (PLUSP (CL-CSV::READ-INTO-BUFFER-UNTIL
                                             BUFFER
                                             S))
                               DO (INCF CNT3))
                               (END-OF-FILE (C) (DECLARE (IGNORE C)))))))
took 384,333 microseconds (0.384333 seconds) to run.
During that period, and with 2 available CPU cores,
     370,654 microseconds (0.370654 seconds) were spent in user mode
       7,943 microseconds (0.007943 seconds) were spent in system mode
 5,744 bytes of memory allocated.

17:42:44 BEGIN read large file by char
17:42:45   END read large file by char
(PROGN (CL-CSV::WITH-CSV-INPUT-STREAM
         (S +TEST-BIG-FILE+)
         (LET ((C))
           (HANDLER-CASE (LOOP WHILE (SETF C (READ-CHAR S))
                               DO (INCF CNT4))
                               (END-OF-FILE (C) (DECLARE (IGNORE C)))))))
took 287,511 microseconds (0.287511 seconds) to run.
During that period, and with 2 available CPU cores,
     280,574 microseconds (0.280574 seconds) were spent in user mode
       7,946 microseconds (0.007946 seconds) were spent in system mode
 3,680 bytes of memory allocated.

 lines:120001 , buffers:133502, buffered-lines:120001, chars:10680089
CL-CSV-TEST.SPEED-TESTS::READ-BY-LINE-AND-BUFFER-AND-CHAR - WARNINGS (1.43s) : 0 assertions passed
  | WARNINGS (1)
  | WARNING: In CL-CSV::BODY inside an anonymous lambda form inside READ-BY-LINE-AND-BUFFER-AND-CHAR-TEST-THUNK: Unused lexical variable C
  | #<STYLE-WARNING #x302001DF844D>
  |
  |
Starting: CL-CSV-TEST.SPEED-TESTS::COLLECT-BIG-FILE-CSV-ROWS
17:42:45 BEGIN read-csv large file test
17:43:19   END read-csv large file test
(PROGN (READ-CSV +TEST-BIG-FILE+))
took 34,273,467 microseconds (34.273468 seconds) to run.
        485,037 microseconds ( 0.485037 seconds, 1.42%) of which was spent in GC.
During that period, and with 2 available CPU cores,
     33,935,609 microseconds (33.935608 seconds) were spent in user mode
        111,776 microseconds ( 0.111776 seconds) were spent in system mode
 75,590,032 bytes of memory allocated.
 7,156 minor page faults, 0 major page faults, 0 swaps.

CL-CSV-TEST.SPEED-TESTS::COLLECT-BIG-FILE-CSV-ROWS - EMPTY (34.28s) : 0 assertions passed

Starting: CL-CSV-TEST.SPEED-TESTS::COLLECT-BIG-FILE-CSV-ROWS-OLD
17:43:19 BEGIN read-csv-old large file test
17:43:53   END read-csv-old large file test
(PROGN (CL-CSV::READ-CSV-OLD +TEST-BIG-FILE+))
took 33,803,458 microseconds (33.803455 seconds) to run.
        506,981 microseconds ( 0.506981 seconds, 1.50%) of which was spent in GC.
During that period, and with 2 available CPU cores,
     33,465,215 microseconds (33.465214 seconds) were spent in user mode
         91,655 microseconds ( 0.091655 seconds) were spent in system mode
 80,648,736 bytes of memory allocated.
 9,373 minor page faults, 0 major page faults, 0 swaps.

Hope you can get everything into a stable working condition at an appropriate speed

analyticd commented 6 years ago

@bobbysmith007 Thank you so much for testing it. As I mentioned before, I downgraded to an earlier quicklisp dist and that was fine in production and is working there now without issue. As I mentioned, it could be that one of those other libraries I listed is significantly slower in the 2018-01-31 quicklisp dist and not cl-csv, I don't know. It could also be that using :row-fn is slower than using the old :map-fn from 2016. Either way, I delivered my update into production for the client and all is well. Thanks again. Appreciate the availability of cl-csv and your work on it.

AccelerationNet / cl-csv

cl-csv:read-csv not skipping header when :skip-first-p t and :map-fn lambda provided #25