Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

add native support for loading CSV files #100

Closed sarnowski closed 9 years ago

sarnowski commented 9 years ago

This commit adds support for "native" CSV files backed by 'clojure.data.csv' in order to have RFC 4180 compliant reading.

This especially helps with CSV files that mix or have quoting for the cells which cannot easily be solved by using the TSV supported delimiter.

https://github.com/clojure/data.csv/

mbossenbroek commented 9 years ago

Nicely done - thanks for the addition!

The only change I can think of is possibly a new name, but I can't think of one off the top of my head. Now we'll have load-tsv, load-csv, and load-lazy; all of which can load tsv's, csv's, or files with other delimiters. So it would look weird to have (load-csv "foo.tsv" \t \") or (load-tsv "foo.csv" "\t") and it's not clear what the difference would be.

Part of me wants to combine them such that there's a single fn that changes behavior based on the number of args passed. 1 & 2 arities do the deafult; 3 accepts the quoting char; maybe a 4th options arg to invoke the lazy version? But it appears that your version takes chars, whereas mine takes a regex. I guess we could take either & convert to the other depending on the impl. I'm not in love with this approach - more of a strawman...

I'm also wondering if we could simply use clojure.data.csv to do the work of load-tsv. Does it support regex splits? Can you disable the quoting (to support the existing functionality)?

Other ideas for names: load-split - but that could be confused with a map-reduce split. load-delimited - clear, but too long.

That said, I'm fine with merging/releasing this as-is and cleaning it up in pigpen v3, due out in a few months. Let me know if you have any ideas for combining these, otherwise I'll go ahead with the merge.

Thanks, Matt

sarnowski commented 9 years ago

I really would not try to press everything into one magic function ;-)

For your questions: Neither regex splits nor disabling quoting is supported by clojure.data.csv.

I also added a test to verify that the quote escaping (two quotes according to RFC) works as defined and added a hint to the load-csv documentation that unlike stated in the RFC, multiline cells are not supported.

mbossenbroek commented 9 years ago

Agreed - I'll think about it some more, but for now it looks good. Thanks again!

mbossenbroek commented 9 years ago

Released in 0.2.12: http://netflix.github.io/PigPen/pigpen.core.html#var-load-csv