A community user recently bumped into CSV data that had duplicate columns. Their specific data source was a CSV export from Jira, and looking at some Jira CSV data I have on hand I can indeed see this for myself (e.g., multiple appearances of both "Labels" and "Attachments"), but there's surely lots of similar messy data lurking out there.
There's surely multiple ways to cope with this. Looking at a mature CSV reader with lots of knobs & users, they deal with it by appending a numeric suffix to make each additional named column unique (https://github.com/pandas-dev/pandas/pull/12935).
$ cat read_csv.py
#!/usr/bin/env python3
import pandas as pd
df = pd.read_csv('dupes.csv')
print(df.to_string(index=False))
$ ./read_csv.py
a a.1 a.2 b
1 2 3 4
2 3 4 5
We've taken this approach in Zed at times in the past to deal with duplicates (appending _1 and _2 in our case), e.g., fuse at one time before it made better use of union types, and join still has some remnants of this (#4760). However, we've generally been getting away from that wherever possible.
If I could put my own proposal on the table, I'd propose that we keep the single column name with the primitive values from the many columns of the same name as elements in an array, e.g.:
{a:[1.,2.,3.],b:4.}
{a:[2.,3.,4.],b:5.}
Regardless of which approach we use, I'd also advocate emitting a warning of which duplicate column names were detected so that way the user knows which ones might need attention, e.g., if we did arrays and they wanted to write some Zed to break the elements back out into separate fields.
Repro is with Zed commit 05fdd90.
A community user recently bumped into CSV data that had duplicate columns. Their specific data source was a CSV export from Jira, and looking at some Jira CSV data I have on hand I can indeed see this for myself (e.g., multiple appearances of both "Labels" and "Attachments"), but there's surely lots of similar messy data lurking out there.
A simplified repro:
There's surely multiple ways to cope with this. Looking at a mature CSV reader with lots of knobs & users, they deal with it by appending a numeric suffix to make each additional named column unique (https://github.com/pandas-dev/pandas/pull/12935).
We've taken this approach in Zed at times in the past to deal with duplicates (appending
_1
and_2
in our case), e.g.,fuse
at one time before it made better use of union types, andjoin
still has some remnants of this (#4760). However, we've generally been getting away from that wherever possible.If I could put my own proposal on the table, I'd propose that we keep the single column name with the primitive values from the many columns of the same name as elements in an array, e.g.:
Regardless of which approach we use, I'd also advocate emitting a warning of which duplicate column names were detected so that way the user knows which ones might need attention, e.g., if we did arrays and they wanted to write some Zed to break the elements back out into separate fields.