brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.34k stars 67 forks source link

Reading CSV files that have duplicate columns #5090

Open philrz opened 3 months ago

philrz commented 3 months ago

Repro is with Zed commit 05fdd90.

A community user recently bumped into CSV data that had duplicate columns. Their specific data source was a CSV export from Jira, and looking at some Jira CSV data I have on hand I can indeed see this for myself (e.g., multiple appearances of both "Labels" and "Attachments"), but there's surely lots of similar messy data lurking out there.

A simplified repro:

$ cat dupes.csv
a,a,a,b
1,2,3,4
2,3,4,5

$ zq -version
Version: v1.14.0-28-g05fdd90a

$ zq -i csv dupes.csv
dupes.csv: duplicate field: "a"

There's surely multiple ways to cope with this. Looking at a mature CSV reader with lots of knobs & users, they deal with it by appending a numeric suffix to make each additional named column unique (https://github.com/pandas-dev/pandas/pull/12935).

$ cat read_csv.py 
#!/usr/bin/env python3
import pandas as pd
df = pd.read_csv('dupes.csv')
print(df.to_string(index=False))

$ ./read_csv.py 
 a  a.1  a.2  b
 1    2    3  4
 2    3    4  5

We've taken this approach in Zed at times in the past to deal with duplicates (appending _1 and _2 in our case), e.g., fuse at one time before it made better use of union types, and join still has some remnants of this (#4760). However, we've generally been getting away from that wherever possible.

If I could put my own proposal on the table, I'd propose that we keep the single column name with the primitive values from the many columns of the same name as elements in an array, e.g.:

{a:[1.,2.,3.],b:4.}
{a:[2.,3.,4.],b:5.}

Regardless of which approach we use, I'd also advocate emitting a warning of which duplicate column names were detected so that way the user knows which ones might need attention, e.g., if we did arrays and they wanted to write some Zed to break the elements back out into separate fields.