brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

CSV writer #1237

Closed philrz closed 4 years ago

philrz commented 4 years ago

A community user asked:

With some non-Zeek NDJSON data sources we turn into ZNG and store, we also may want to write to a SQL db (JSON works but prefer CSV) to populate visualizations of the data in say like an Apache Superset (which I actually have plans to integrate for Zeek data, probably back out in Zeek JSON from ZNG).

alfred-landrum commented 4 years ago

I think the CSV writer should handle heterogenous records, though I assume that'd require writing to a temporary file to build the union of all the record types. That may be more than what the community user asked for, if their target was a sql db import.

mccanne commented 4 years ago

@alfred-landrum, agreed. I will write up a new issue for this. There are a couple ways to do it.

mccanne commented 4 years ago

We also need to change the search endpoint with csv output (as part of the larger refactoring there) so the front-end doesn't have to have its own csv writer and record unflattener in javascript.

philrz commented 4 years ago

Verified in zq commit 4bce00d.

Output as CSV currently works for any data that can be represented by a single "descriptor" (i.e. they all have the same record type). For instance, the contents of a single Zeek log:

$ zq -f csv stats.log.gz 
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,-,-,-,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
stats,2018-03-24T17:20:20.60102Z,zeek,281,6435390,3052165142,-,-,-,2575313,2575317,5818,187,1070,338620,5126,9490,1301330,33086,48528,27,61,0,77448,53168,0,0
stats,2018-03-24T17:25:20.601054Z,zeek,282,4556854,1920350471,-,-,-,1958516,1958507,4401,110,1010,211123,4600,2354,1006374,27351,41930,10,5,0,90752,0,1672,0
stats,2018-03-24T17:30:20.601101Z,zeek,281,8112284,3883978461,-,-,-,1586683,1586685,2795,137,648,207754,4666,31554,928834,19429,29547,26,3,0,126712,395584,1504,0
stats,2018-03-24T17:35:20.601137Z,zeek,282,5467567,3398705931,-,-,-,1535999,1535998,4239,146,305,193639,4731,2510,879701,25895,35230,88,6,0,455128,0,0,0

It's then trivial to put this output into my paste buffer and, say, enter it into a Google Sheet and select Data > Split Text To Columns.

image

If the data trying to be output requires multiple descriptors, the user will receive an error message and the output will stop when the first record of a different type is encountered. For example, if you try to output two Zeek logs at once via CSV:

$ zq -f csv stats.log.gz weird.log.gz 
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,-,-,-,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
csv output requires uniform records but different types encountered

However, we do have a plan to address this case as well, and that's tracked in #1271.

Finally, I noticed that csv is not yet listed in the -f output formats in the zq help text, so #1300 tracks getting that added.

Thanks @mccanne!