AccelerationNet / cl-csv

A common lisp library providing easy csv reading and writing
Other
116 stars 22 forks source link

'continue' restart doesn't work in cl-csv:read-csv (without :sample) #35

Open mjanzen opened 5 years ago

mjanzen commented 5 years ago

I'm running cl-csv-20180831-git (the most recent version I see in Quicklisp) on SBCL 1.5.4. I've found the library to be really useful, and easy to work with; but I've noticed one discrepancy between its documentation and the code.

The cl-csv README.md file says: in-csv iterate clause and read-csv support continue and filter restarts for errors occurring during read-csv-row

So, I added a handler-bind to my calling code which would catch csv-parse-error and invoke the continue restart. When I tried to read a bad CSV file I found that, as expected, the restart worked correctly for in-csv; while for read-csv-row and do-csv the Lisp debugger reported that no CONTINUE restart was active.

However, the debugger also complained that the restart wasn't active in read-csv, although the documentation says this should have worked. Interestingly, I found that if I call (read-csv stream :sample 999999), then the restart works again.

A quick look at csv.lisp shows that the in-csv macro invokes an internal function called restartable-read-row; and in-csv is used by read-csv-sample. In read-csv, if :sample is specified then read-csv-sample is called; but in the default case, read-csv-with-reader is called instead, and it appears that this does not support the restart.

This explains the behaviour I observed. But is this a bug in the read-csv code, or is it a documentation error - should callers of read-csv not be able to use this restart?

My opinion is that the restart is generally useful and that, as well as being fixed for read-csv, it could arguably be extended to read-csv-row and do-csv too. Am I missing something - is there a reason why it should be restricted to in-csv and read-csv? (Performance considerations, maybe?)

I'd love to be able to submit a pull request with this; but being a newcomer both to cl-csv and to Lisp itself, I'm not sure that would end well. Please have a look and see whether/how the restart can be made available; thanks very much.

bobbysmith007 commented 5 years ago

Thanks for the well written ticket! I don't have much lisp time in my life anymore, and its hard to say when I will have time to update this.

I am not sure the answer to your question as to why its like that. It probably lies in the fact that the parser/readers were rewritten a time or two, and that perhaps one of those dropped the continues restarts when we added the *enable-signals* variable.

I'ld be happy to look at any patches that pass the test suite that you submit in the mean time. I do think that having the restarts makes sense (depending on the speed implications of having them). The could/should be triggered by the existing variable *enable-signals* which already switches the restart filter on and off. This would need some good tests to figure out if it breaks the inner state of the csv-reader.

My best guess looking now:

parser.lisp#420

(defun read-csv-row-with-reader (stream-or-string
                                &key csv-reader
                                (map-fn nil map-fn-p)
                                (data-map-fn nil data-map-fn-p)
                                &allow-other-keys)
  "Read a row of csv from the input"
  (flet ((return-row (it)
           (return-from read-csv-row-with-reader it)))
    (unless csv-reader
      (setf csv-reader (make-default-csv-reader)))
    (when map-fn-p (setf (map-fn csv-reader) map-fn))
    (when data-map-fn-p (setf (data-map-fn csv-reader) data-map-fn))
    (setf (row-fn csv-reader) #'return-row)
    (with-csv-input-stream (in-stream stream-or-string)
      (if *enable-signals*
          (with-simple-restart (continue)
            (read-with-dispatch-table csv-reader in-stream))
          (read-with-dispatch-table csv-reader in-stream)))))

I'ld like to get to this, but with it not being paying work, its hard to devote time to it right now.

mjanzen commented 5 years ago

Understood, Russ; I know what you mean, and appreciate the quick proposal for a fix.

Before I change anything, I'd like to make sure I understand the relationship between cl-csv's signals and restarts correctly. I can see that the *enable-signals* flag must be non-nil in order for the csv-data-read and csv-row-read conditions to be signalled, and thus for the caller to be able to examine and replace the CSV values for a given row by handling one or both conditions and calling the filter restart. This makes good sense, because in normal usage most callers are unlikely to want to replace successfully-parsed incoming data, and they shouldn't have to pay for the overhead.

However, in the current restartable-read-row, neither the continue nor the filter restart is affected by *enable-signals*-- and I would argue that this is the correct behaviour when the parser has detected invalid CSV data and signalled a csv-parse-error. At this point performance goes out the window: it's most important to report the error and to allow the caller to decide whether, and how, to proceed (with the default behaviour being to terminate the call to read-csv with an error, as at present). This should remain independent of the value of *enable-signals*.

Ideally there would always be two ways to recover, with the continue restart keeping the fields parsed up until the error was detected and discarding the rest of the row, and filter replacing the bad row with its argument, causing the row to be discarded if the argument is nil.

(I suppose one could argue that a nil value for *enable-signals* indicates that even csv-parse-error should not be signalled, implying that bad CSV rows are quietly ignored. But this is a significant change to the current behaviour, it's arguably dangerous, and it'll quietly break existing code; so it seems a rather bad idea.)

I'd need to study the code a bit more to see how the restarts might be made to work consistently across all the various external functions/macros. It's a bit more involved than simply wrapping the call in with-simple-restart; and as this is also not my day job, I make no promises - though it's certainly helping me to learn about Common Lisp's error handling!

bobbysmith007 commented 5 years ago

I think that you are correct in the continue should be always be in effect. However, if you are sure that your CSVs do not contain errors - or are willing to accept failure if not. In a close loop, adding a continue restart makes the loop take 4 times as long (in my SBCL 1.4.2 on Ubuntu18)

Thus if you were trying to write "fast as possible" code, you would probably want to turn off the continue bindings. Whether or not that makes sense as a trade off is probably work load related.

That said, it seems (based on the other open ticket) that the optimizations previously made dont necessarily hold anymore and perhaps its better to code for usefulness rather than performance

(progn (time (iter (for i from 0 to 100000)
                            (* i i)))
                (time (iter (for i from 0 to 100000)
                            (with-simple-restart (continue "Continue") 
                              (* i i)))))
NIL
Evaluation took:
  0.001 seconds of real time
  0.004000 seconds of total run time (0.004000 user, 0.000000 system)
  400.00% CPU
  3,876,560 processor cycles
  0 bytes consed

Evaluation took:
  0.004 seconds of real time
  0.004000 seconds of total run time (0.004000 user, 0.000000 system)
  100.00% CPU
  13,983,384 processor cycles
  3,178,496 bytes consed
bobbysmith007 commented 5 years ago

Hrm... Another interesting aspect is that: CL-CSV-TEST::CSV-CONTINUE-SIGNALS passes without any changes and using only "read-csv", but I think that is an aspect of the continue restart in the test case as the assertions are not passing..

I also see that this is difficult because when we switched from a state machine. The continue, needs to reset the parse table / reader and advance the stream to the next start of row