Closed mccanne closed 3 years ago
Verified in zq
commit 1e501e85
.
Circling back to the previous behavior, in the last GA zq
release tagged v0.26.0
, invoking -f csv
with heterogeneous data halted the output. Using the zq-sample-data:
$ zq -version
Version: v0.26.0
$ zq -f csv pe.log.gz stats.log.gz
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
csv output requires uniform records but different types encountered
Now with the benefit of the enhancement, the output proceeds to completion.
$ zq -version
Version: v0.26.0-25-g1e501e85
$ zq -f csv pe.log.gz stats.log.gz
_path,ts,id,machine,compile_ts,os,subsystem,is_exe,is_64bit,uses_aslr,uses_dep,uses_code_integrity,uses_seh,has_import_table,has_export_table,has_cert_table,has_debug_data,section_names,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,,,,,,,,,,,,,,,,,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
pe,2018-03-24T17:15:54.475076Z,FC6cOXTjuh6OdYwu5,I386,2010-07-12T21:46:18Z,Windows 95 or NT 4.0,WINDOWS_GUI,T,F,F,F,F,T,T,F,F,F,".text,.data,.rdata,.bss,.idata",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:37.127059Z,FBRCYv3eG0d8TEWHy9,AMD64,2011-02-10T08:03:04Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,T,F,T,T,T,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:37.068955Z,FD0dMoWO5DNexKwNb,AMD64,2010-11-20T09:45:06Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,T,F,T,T,F,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:36.94092Z,F4ehaa3sa8zKtCWcF9,AMD64,2010-11-20T09:45:00Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,F,F,T,T,F,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
...
stats,2018-03-24T17:35:20.601137Z,,,,,,,,,,,,,,,,,zeek,282,5467567,3398705931,,,,1535999,1535998,4239,146,305,193639,4731,2510,879701,25895,35230,88,6,0,455128,0,0,0
And if the user is confident their data should conform to a single record definition and hence want to avoid the two passes through the data to guarantee a fuse
'd schema, they can invoke the -csvfuse=false
option, which in this case would once again halt the output as we saw before.
$ zq -f csv -csvfuse=false pe.log.gz stats.log.gz
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
csv output requires uniform records but different types encountered
Thanks @nwt!
The CSV writer should be able to write output zng data that comes from different record types. Including the new
fuse
processor at the end of the ZQL pipeline ensures this is possible today. However,fuse
requires making two passes through the data, which has a performance cost and delays the immediate stream of output. Power users that are confident their data already conforms to a single record definition may want to avoid this penalty.As a group we discussed adding a flag to determine this behavior when CSV output format is requested. In one mode it would always implicitly add
fuse
to the pipeline even if the user didn't request it, ensuring successful CSV output no matter what. There was consensus that this behavior would be invoked by the Brim app for CSV export. The other mode would follow the current behavior where only a single pass is made through the data and output stops as soon as a record is encountered in the stream that doesn't match the schema for the header already printed, at which point the user would see a message that effectively tells them to rework their query or explicitly addfuse
. There still seemed to be some room for debate on whetherzq
at the command line should also default to the "alwaysfuse
" behavior planned for the Brim app or if thezq
default should flip to this more "power user" mode.