brimdata / super

An analytics database that puts JSON and relational tables on equal footing
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.4k stars 67 forks source link

Auto-detect sees CSV input as a ZSON syntax error #4592

Open philrz opened 1 year ago

philrz commented 1 year ago

Repro is with Zed commit a2626a7. This issue was found with test data from a recent @jameskerr draft blog post.

Using the attached test data mint.csv, an attempt to read this data with zq via auto-detect fails.

$ zq -version
Version: v1.7.0-66-ga2626a7c

$ zq -z mint.csv 
mint.csv: ZSON syntax error

However, if I set the input format as explicitly CSV, it works fine.

$ zq -i csv -z mint.csv 
{Date:"5/12/2023",Description:"LYFT   *TEMP AUTH HOLD","Original Description":"LYFT   *TEMP AUTH HOLD",Amount:5.,"Transaction Type":"debit",Category:"Ride Share","Account Name":"CREDIT CARD",Labels:null,Notes:null}
{Date:"5/11/2023",Description:"TST* Automat","Original Description":"TST* Automat",Amount:5.22,"Transaction Type":"debit",Category:"Restaurants","Account Name":"CREDIT CARD",Labels:null,Notes:null}
{Date:"5/10/2023",Description:"Spotify USA","Original Description":"Spotify USA",Amount:12.99,"Transaction Type":"debit",Category:"Music","Account Name":"CREDIT CARD",Labels:null,Notes:null}

Can we improve on this? As a user, my expectation is that if the auto-detect fails on a particular format, it would try the others in hopes one would succeed. Since CSV was indeed successful, I was surprised at having to specify -i.

philrz commented 1 year ago

@nwt explained that this happened because the auto-detect currently locks in ZSON as the detected format once it's able to successfully parse a ZSON value. Because the lead-off text of this CSV file is "Date" that's enough to get locked into ZSON, then the , that follows causes the syntax error. Therefore we might consider revisiting the auto-detect logic there.

As long as there's some potential for this kind of exposure, @mccanne also pointed out that the error message could stand to be more verbose here. If the user believes their data to be CSV but was relying on auto-detect out of convenience, being confronted with a ZSON error is potentially confusing. Therefore, it could instead be something that expresses "the input was auto-detected and believed to be ZSON, but then there was a syntax error, so consider specifying -i with an explicit format".