Fairly slow on larger CSV files

pnathan commented 13 years ago

Hi,

I've been working with largish CSV files (> 10K lines), and speed is getting to be a critical priority.

I grabbed the latest quicklisp distro of cl-csv, and despite all my type assertions in SBCL, I can not speed it up as fast as I really want it to be.

My approach here is to stream the file through and only pick out the datapoints I need.

Using this mechanism, I complete the batch job on my Big File (has 145K lines) in 8.4 seconds in Common Lisp, 3.7 seconds in Haskell, and 0.7 in CPython

I looked around the cl-csv code a bit and noticed it didn't seem very tuned. Is this my eyes or has it not had a signfiicant optimization pass on it?

I wouldn't mind collaborating to tune CL-CSV to optimize it.

I've included my code here so you can see what my approach is.

(defun process-csv (csv-name process-function )
  (proclaim '(optimize (speed 3) (safety 0) (debug 0)))
  (let ((collector))
    (declare (type list collector))
    (labels ((row-processor (row)
               (let ((result (the list (funcall process-function row))))
                 (declare (type list result))
                 (if result
                     (push result collector)))))

      (with-open-file (fin csv-name)
        (cl-csv:read-csv fin
                         :row-fn #'row-processor))
      collector)))

(defun return-blanks (filename)
  (proclaim '(optimize (speed 3) (safety 0) (debug 0)))
  (labels ((row-checker (row)
             (declare (type list row))
             (if (string-equal (the string (sixth row)) "")
                 (list (first row)
                       (sixth row)))))
    (process-csv
     filename
     #'row-checker)))

bobbysmith007 commented 13 years ago

I have used this project to read/write fairly large CSVs but speed has never been a priority for me as it has always been fast enough, or slowed down enough by the other processes involved, that it seemed unneeded. I am certainly willing to optimize (or include patches that do), but probably not at the expense of general ease of use.

Do those proclamations actually increase the speed of those? I have used such proclamations before and have generally found them to be only a minor help (ie (for me) not worth worrying about). I am definitely interested in any input you can provide (eg through profiling) about where the slowdown is occurring.

Thanks, Russ

pnathan commented 13 years ago

On my Lisp system ( SBCL 51 on amd64 Linux (SBCL was built from source)), the proclamations, declarations, and THEs have shaved some minor time off my code, and have trimmed the DISASSEMBLE outputs down a little bit.

Here is my profile from sb-profile:report (it only goes down to the function level)-

measuring PROFILE overhead..done
  seconds  |     gc     |     consed    |  calls  |  sec/call  |  name  
-------------------------------------------------------------
     8.862 |      0.060 | 1,054,208,944 | 145,770 |   0.000061 | CL-CSV:READ-CSV-ROW
     0.072 |      0.000 |             0 |       1 |   0.071692 | CL-CSV:READ-CSV
     0.000 |      0.000 |             0 |       1 |   0.000000 | RETURN-BLANKS
     0.000 |      0.000 |             0 |       1 |   0.000000 | PROCESS-CSV
-------------------------------------------------------------
     8.933 |      0.060 | 1,054,208,944 | 145,773 |            | Total

estimated total profiling overhead: 0.12 seconds
overhead estimation parameters:
  0.0s/call, 8.e-7s total profiling, 4.e-7s internal profiling

And now, from the statistical SBCL profiler, one ton of information.

One of the big checks is sequencep. I'm wondering if type decl wouldn't remove that.

Number of samples:   942
Sample interval:     0.01 seconds
Total sampling time: 9.42 seconds
Number of cycles:    0
Sampled threads:
 #<SB-THREAD:THREAD "initial thread" RUNNING {10029311B1}>

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1    106  11.3    106  11.3    106  11.3        -  SB-KERNEL:SEQUENCEP
   2     99  10.5    338  35.9    205  21.8        -  CL-CSV::%ESCAPE-SEQ?
   3     88   9.3    114  12.1    293  31.1        -  ELT
   4     80   8.5    132  14.0    373  39.6        -  LENGTH
   5     61   6.5     61   6.5    434  46.1        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
   6     57   6.1     57   6.1    491  52.1        -  SB-VM::GENERIC-+
   7     52   5.5    121  12.8    543  57.6        -  CL-CSV::%CHAR-AT
   8     50   5.3     50   5.3    593  63.0        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
   9     44   4.7    899  95.4    637  67.6        -  CL-CSV:READ-CSV-ROW
  10     40   4.2     42   4.5    677  71.9        -  SB-IMPL::FD-STREAM-READ-N-CHARACTERS/UTF-8
  11     34   3.6     34   3.6    711  75.5        -  "foreign function pthread_sigmask"
  12     30   3.2    168  17.8    741  78.7        -  VECTOR-PUSH-EXTEND
  13     22   2.3     51   5.4    763  81.0        -  SB-KERNEL:STRING-SUBSEQ*
  14     21   2.2     21   2.2    784  83.2        -  SB-KERNEL:UB32-BASH-COPY
  15     18   1.9     18   1.9    802  85.1        -  SB-IMPL::SLOW-HAIRY-DATA-VECTOR-SET
  16     17   1.8     84   8.9    819  86.9        -  (LABELS CL-CSV::FINISH-ITEM)
  17     13   1.4     13   1.4    832  88.3        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-SET
  18     12   1.3     23   2.4    844  89.6        -  REPLACE
  19     11   1.2     11   1.2    855  90.8        -  "foreign function getrusage"
  20     11   1.2     11   1.2    866  91.9        -  SB-KERNEL:HAIRY-DATA-VECTOR-SET
  21     10   1.1     60   6.4    876  93.0        -  SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER
  22      8   0.8      8   0.8    884  93.8        -  STRING
  23      8   0.8      8   0.8    892  94.7        -  TRUNCATE
  24      5   0.5     14   1.5    897  95.2        -  MAKE-ARRAY
  25      4   0.4    939  99.7    901  95.6        -  (LAMBDA (SB-INT:&MORE SB-PROFILE::ARG-CONTEXT SB-PROFILE::ARG-COUNT))
  26      4   0.4     52   5.5    905  96.1        -  ADJUST-ARRAY
  27      4   0.4     10   1.1    909  96.5        -  SB-KERNEL:%CEILING
  28      4   0.4      4   0.4    913  96.9        -  SB-KERNEL:%WITH-ARRAY-DATA
  29      3   0.3      3   0.3    916  97.2        -  SB-KERNEL:CSUBTYPEP
  30      3   0.3      3   0.3    919  97.6        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
  31      3   0.3      3   0.3    922  97.9        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF
  32      2   0.2      2   0.2    924  98.1        -  (FLET #:CLEANUP-FUN-[REFILL-INPUT-BUFFER]731)
  33      2   0.2      2   0.2    926  98.3        -  (LABELS SB-PROFILE::%INCF)
  34      2   0.2      2   0.2    928  98.5        -  STRING-EQUAL
  35      1   0.1    938  99.6    929  98.6        -  CL-CSV:READ-CSV
  36      1   0.1     18   1.9    930  98.7        -  SB-IMPL::DATA-VECTOR-FROM-INITS
  37      1   0.1      4   0.4    931  98.8        -  SB-KERNEL:VALUES-SPECIFIER-TYPE
  38      1   0.1      3   0.3    932  98.9        -  (LABELS ROW-CHECKER)
  39      1   0.1      1   0.1    933  99.0        -  NTHCDR
  40      1   0.1      1   0.1    934  99.2        -  SB-IMPL::%VECTOR-WIDETAG-AND-N-BITS
  41      1   0.1      1   0.1    935  99.3        -  "foreign function closure_tramp"
  42      1   0.1      1   0.1    936  99.4        -  (LABELS SB-IMPL::SXHASH-RECURSE)
  43      1   0.1      1   0.1    937  99.5        -  GET-INTERNAL-RUN-TIME
  44      1   0.1      1   0.1    938  99.6        -  SB-KERNEL:%SET-FILL-POINTER
  45      1   0.1      1   0.1    939  99.7        -  SB-IMPL::GET-NEW-FILL-POINTER
  46      1   0.1      1   0.1    940  99.8        -  SB-KERNEL:SET-ARRAY-HEADER
  47      1   0.1      1   0.1    941  99.9        -  SB-IMPL::%WALK-DISPLACED-ARRAY-BACKPOINTERS
  48      1   0.1      1   0.1    942 100.0        -  SB-INT:EQUAL-BUT-NO-CAR-RECURSION
  49      0   0.0    939  99.7    942 100.0        -  PROCESS-CSV
  50      0   0.0    939  99.7    942 100.0        -  (EVAL (SB-SPROF:WITH-PROFILING (:MAX-SAMPLES 10000 :REPORT :FLAT :LOOP NIL)))
  51      0   0.0    939  99.7    942 100.0        -  SB-INT:SIMPLE-EVAL-IN-LEXENV
  52      0   0.0    939  99.7    942 100.0        -  EVAL
  53      0   0.0    939  99.7    942 100.0        -  INTERACTIVE-EVAL
  54      0   0.0    939  99.7    942 100.0        -  SB-IMPL::REPL-FUN
  55      0   0.0    939  99.7    942 100.0        -  "Unknown component: #x10000A2750"
  56      0   0.0    939  99.7    942 100.0        -  SB-IMPL::%WITH-REBOUND-IO-SYNTAX
  57      0   0.0    939  99.7    942 100.0        -  SB-IMPL::TOPLEVEL-REPL
  58      0   0.0    939  99.7    942 100.0        -  SB-IMPL::TOPLEVEL-INIT
  59      0   0.0    939  99.7    942 100.0        -  (FLET #:WITHOUT-INTERRUPTS-BODY-[RESTART-LISP]30)
  60      0   0.0    939  99.7    942 100.0        -  (LABELS SB-IMPL::RESTART-LISP)
  61      0   0.0     70   7.4    942 100.0        -  (LABELS CL-CSV::READ-LINE-IN)
  62      0   0.0     42   4.5    942 100.0        -  SB-INT:FAST-READ-CHAR-REFILL
  63      0   0.0     34   3.6    942 100.0        -  "foreign function interrupt_handle_pending"
  64      0   0.0      9   1.0    942 100.0        -  (FLET #:CLEANUP-FUN-[EXIT-TAG-[PROFILE-ENCAPSULATION-LAMBDAS]171]178)
  65      0   0.0      8   0.8    942 100.0        -  SUBTYPEP
  66      0   0.0      5   0.5    942 100.0        -  SB-KERNEL:SPECIFIER-TYPE
  67      0   0.0      3   0.3    942 100.0        -  (LABELS ROW-PROCESSOR)
  68      0   0.0      2   0.2    942 100.0        -  SB-IMPL::REFILL-INPUT-BUFFER
  69      0   0.0      1   0.1    942 100.0        -  REDUCE
  70      0   0.0      1   0.1    942 100.0        -  SB-PROFILE::INCF-COUNTER
------------------------------------------------------------------------
          0   0.0                                     elsewhere

bobbysmith007 commented 13 years ago

Cool, thanks for the profiling!

I am seeing:

5-6 sec to read ~10M file with 240,000 rows in it counting each row
~8 sec to collect all 240,000 rows

What filesize (in bytes) are the files you are trying to speed up? I just want to make sure that I am testing similar situations

bobbysmith007 commented 13 years ago

Also note that your proclamations are currently affecting your entire lisp system (they are global not local).

http://www.lispworks.com/documentation/HyperSpec/Body/f_procla.htm

pnathan commented 13 years ago

I added some basic type checks to %char-at, %escape-seq?, and read-csv-row, shaving the time down to about 7.7 seconds. The number of bytes consed is 1,051,897,728 bytes consed, which I suspect to be the key issue here.

I ran another check, checking to see about time coming from READ-LINE, which is not the fastest cat in the house.

  seconds  |     gc     |     consed    |  calls  |  sec/call  |  name  
-------------------------------------------------------------
     7.313 |      0.180 |   745,290,272 | 145,770 |   0.000050 | CL-CSV:READ-CSV-ROW
     0.782 |      0.090 |   306,590,960 | 145,770 |   0.000005 | READ-LINE
     0.000 |      0.000 |             0 |       1 |   0.000000 | CL-CSV:READ-CSV
     0.000 |      0.000 |             0 |       1 |   0.000000 | PROCESS-CSV
     0.000 |      0.000 |             0 |       1 |   0.000000 | RETURN-BLANKS
-------------------------------------------------------------
     8.095 |      0.270 | 1,051,881,232 | 291,543 |            | Total

bobbysmith007 commented 13 years ago

I am currently running:

(in-package :cl-csv-test)
(declaim (optimize (speed 3) (safety 0) (debug 0)))

(defun process-csv (csv-name process-function )
  (let ((collector))
    (declare (type list collector))
    (labels ((row-processor (row)
               (let ((result
                       (the list (funcall (the function process-function) row))))
                 (declare (type list result))
                 (if result
                     (push result collector)))))
      (cl-csv:read-csv csv-name :row-fn #'row-processor)
      collector)))

(defun return-blanks (filename)
  (labels ((row-checker (row)
             (declare (type list row))
             (if (string-equal (the string (sixth row)) "")
                 (list (first row)
                       (sixth row)))))
    (process-csv filename #'row-checker)))

(defun test-pnathan-code ()
  (time-and-log-around (test-log "test-pnathan read")
    (return-blanks +test-big-file+)))

This is the same code with none of the declarations (in a separate file so not affected by the declaims)

(defun process-csv2 (csv-name process-function )
  (let ( collector )
    (labels ((row-processor (row)
               (let ((result (funcall process-function row)))
                 (when result
                   (push result collector)))))
      (cl-csv:read-csv csv-name :row-fn #'row-processor)
      collector)))

(defun return-blanks2 (filename)
  (labels ((row-checker (row &aux (6th (sixth row)) )
             (if (string-equal 6th "")
                 (list (first row) (sixth row)))))
    (process-csv2 filename #'row-checker)))

(defun test-pnathan-code2 ()
  (time-and-log-around (test-log "test-pnathan read")
    (return-blanks2 +test-big-file+)))

Running these two back to back I get that the one without declarations is the same speed or a touch faster:

16:07:38 BEGIN test-pnathan read
16:07:42   END test-pnathan read
Evaluation took:
  3.573 seconds of real time
   3.490000 seconds of total run time (3.280000 user, 0.210000 system)
   [ Run times consist of 0.260 seconds GC time, and 3.230 seconds non-GC time. ]
   97.68% CPU
   8,908,976,325 processor cycles
   229,370,544 bytes consed

16:07:42 BEGIN test-pnathan2 read
16:07:45   END test-pnathan2 read
Evaluation took:
  3.410 seconds of real time
   3.320000 seconds of total run time (3.210000 user, 0.110000 system)
   [ Run times consist of 0.200 seconds GC time, and 3.120 seconds non-GC time. ]
   97.36% CPU
   8,502,469,860 processor cycles
   229,333,696 bytes consed

It seems like the proclamations are not helping your code run faster, but across the whole system perhaps they are. However running the whole image at speed 3 / safety 0 can have bad consequences for code not prepared for this.

pnathan commented 13 years ago

To characterize the worst-case csv:

54,532,563 bytes 145,769 lines 38 columns

The internal data is company confidential, so I am unable to give you a copy file.

Edit: Sorry about the close/reopen, hit the wrong button.

This is my first time seriously working on optimizing Lisp code - I didn't know Proclaim was global. It... could be problematic for some code. :-/

bobbysmith007 commented 13 years ago

Just wanted to let you know that I added a minor optimization, but I don't think this will increase your speed too much (in my brief testing it was ~10-20%) faster.

bobbysmith007 commented 13 years ago

I just now see that you tried making the same functions a bit better. I think most of the improvement I squeezed out in the last patch was in regards to length checking (calling length once instead of mulitple times). In general I have not found type declarations to help matters nearly as much as removing duplicate effort. It has seemed to me that the SBCL compiler does a good job of correctly declaring types.

One thing that I know of that will increase speed (if your data allows) is by ensuring that we are getting simple-strings as opposed to strings. Currently asking on #lisp how to do this

bobbysmith007 commented 13 years ago

Actually looking into this further read-line seems to be returning simple-strings by default on my box, so probably nothing else to get there,

bobbysmith007 commented 13 years ago

Went back to declarations and (more) length call reduction in the tight / per character loop by taking your above proposition of type declarations in escape? and that seemed to shave off another ~10%.

I'm done for the day. I hope this helps make your life a bit easier. Please keep me informed about any further optimizations that you find.

Cheers, Russ

pnathan commented 13 years ago

Russ,

Thanks for your hard work. Looks like the current edition from git runs in ~60% of the time I was originally getting.

  5.345 seconds of real time
  5.340000 seconds of total run time (5.290000 user, 0.050000 system)
  [ Run times consist of 0.150 seconds GC time, and 5.190 seconds non-GC time. ]
  99.91% CPU
  15,994,870,092 processor cycles
  1,051,878,176 bytes consed

AccelerationNet / cl-csv

Fairly slow on larger CSV files #1