davidsantiago / clojure-csv

A library for reading and writing CSV files from Clojure
187 stars 35 forks source link

Lazy reading from file & test fixes. #6

Closed i0cus closed 13 years ago

i0cus commented 13 years ago

Hi David,

I have added some code that let's me evaluate CSV without reading the entire file into memory at once (multi-method for java.io.Reader). Wrote some tests for that.

Also fixed your two tests regarding exception throwing during parsing with strict.

I'm not saying that you should merge this commit as is, but having a way to work with files lazily (read and maybe even write) would be awesome and this is one way to do it (without changing any of your parsing logic and mechanisms, with a small addin). Also -- please take a look at fixed tests.

Regards, Slawek.

davidsantiago commented 13 years ago

This looks like good code. Thanks for the test fixes, I forgot those tests were even in there commented out.

I'm very ambivalent about the lazy file reader, though, for exactly that reason from the test. If you parse a file and return a lazy value based on a lazy file reader, when you go to read it, you will die since the file won't still be open. There's currently some work on fixing this problem in future Clojure releases, it seems, but for now I worry that this might just be more confusing. Can you convince me that this is an overall useful feature to have to make the danger worth it? Do you have a use case in mind?

- David 
i0cus commented 13 years ago

No worries, I've been tinkering with it since I've put issue out -- did not have time to follow through because of day job. ;-)

As I said before -- I am actually proposing this for you consideration as "one way of doing things".

Well for me that thing I've put in the test is actually a feature -- you don't leak the file descriptor while processing, it's closed by "with-open", so processing must be finalized within it. I don't see an issue with it if it's documented to work as advertised, i.e.

One use case I have in mind is "files that do not fit in memory" -- currently you have to use another library for that, since clojure-csv only works with in-memory-String as both input and output.

This also gives you an ability to parse files that you do not want to read at once, because it's using Reader's .read method to get input byte-by-byte and your processing function is not keeping the head of resulting seq. If this could be paired with lazy writing you would have a nice solution for processing CSV in-out with low memory footprint on the IO part. Caveat: this is not getting JITed and is kind of slow and needs to be benchmarked. I have no idea for speed improvements though.

Bonus: my proposal does not actually change the process for people who don't need that -- slurping files and feeding resulting string into process-csv is still the preferred method.

davidsantiago commented 13 years ago

Fair enough, I'm convinced. I suppose it's possible there are really big CSVs out there, and we should definitely not require a gigabyte string or whatever to work through them. I'll see about merging in your changes tomorrow.