d3 / d3-dsv

A parser and formatter for delimiter-separated values, such as CSV and TSV.
https://d3js.org/d3-dsv
ISC License
437 stars 76 forks source link

Repeated columns names erase each other for xParse #72

Closed mcnuttandrew closed 4 years ago

mcnuttandrew commented 4 years ago

There is a small ambiguity in the way that the tsvParse and csvParse address parsing files with columns that non-unique names. For instance if you have a tsv like

Example A   Example B   Example A
1   5   0
2   5   0
3   5   0
4   5   0

And you run that through tsvParse then you get

[
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  columns: [ 'Example A', 'Example B', 'Example A' ]
]

The problem of course being that the data from the first Example A column is blown away during the parse. I'm not sure what the right solution to this might be: maybe including some messaging in the docs that column names need to be unique? Or maybe appending an incrementing index to the duplicated columns ('Example A-1' or something). Having recently been bit by this, this is a real hair pulling issues to find/resolve, so any help that might be offered to other people in a similar situation would no doubt be welcomed.

Fil commented 4 years ago

I think it's a good idea, and a possible implementation is given in https://github.com/d3/d3-dsv/pull/73

Note however that it would be a breaking change (people who already have some code running and this type of data expect it to continue working).

mcnuttandrew commented 4 years ago

I like your solution, but I don't know if it's worth issuing a breaking change. I think just including some stuff in the documentation would probably get most people through the hurdle of identifying this error

Fil commented 4 years ago

I don't know… The thing is that, when the data has this shape (and when you don't control it), it's currently quite difficult to manipulate: you have to load it as text, then fiddle with the first line, then dsv.parse… I've had to do this literally last week. (Plus, we're going to issue a major version soon, so having a breaking change is not that problematic.)

mcnuttandrew commented 4 years ago

Oh i didn't know a major version was coming! This seems like a great approach then

Fil commented 4 years ago

I ♻️ my code into a notebook (and added "empty names" as well) https://observablehq.com/@fil/csv-duplicate-names

Fil commented 4 years ago

Fixed in 8ab1ab86899338c93b3aa07c21f5a63e1c73f37d ; thank you!