Option for skipping comment rows

d3 / d3-dsv

A parser and formatter for delimiter-separated values, such as CSV and TSV.

https://d3js.org/d3-dsv

ISC License

436 stars 76 forks source link

Option for skipping comment rows #68

Closed tuner closed 4 years ago

tuner commented 4 years ago

Hi!

Currently, d3-dsv is very pure when it comes to RFC 4180. However, I need some nonstandard but practical features such as skipping comment lines or interpreting NAs as null (for example, in files created with R). Do you accept such pull requests?

So, this draft pull request adds an option object to dsvFormat with a single supported option: comment. Example:

// Skip all rows that start with a hash character
dsv.dsvFormat("|", { comment: "#" }).parseRows(...)

Fil commented 4 years ago

There are so many ways a CSV can be "improper" that I'm not sure we can plan for all of them in this module.

When I'm faced with this type of file I usually filter out the comments in the row function. Here's an example that deals with two types of comments, at the bottom or at the top, including "continuation comments": https://observablehq.com/@fil/parse-csv-with-comments

Another fun CSV manipulation technique can be found in Mike’s arctic sea ice volume notebook, where multiple spaces are replaced by a comma before applying dsv.parse.

tuner commented 4 years ago

Okay, thanks for the comment and examples! Exploiting the row function is a nice technique.

Anyway, my intention is to provide the users (of my application) with options for handling some common issues with CSV files, but definitely not an exhaustive solution. There are some other CSV parsers with such options, but d3-dsv has superior performance. Perhaps I just maintain my own forked version. Thanks anyway!