brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Invoke fuse to ensure output of heterogeneous csv #1271

Closed mccanne closed 3 years ago

mccanne commented 4 years ago

The CSV writer should be able to write output zng data that comes from different record types. Including the new fuse processor at the end of the ZQL pipeline ensures this is possible today. However, fuse requires making two passes through the data, which has a performance cost and delays the immediate stream of output. Power users that are confident their data already conforms to a single record definition may want to avoid this penalty.

As a group we discussed adding a flag to determine this behavior when CSV output format is requested. In one mode it would always implicitly add fuse to the pipeline even if the user didn't request it, ensuring successful CSV output no matter what. There was consensus that this behavior would be invoked by the Brim app for CSV export. The other mode would follow the current behavior where only a single pass is made through the data and output stops as soon as a record is encountered in the stream that doesn't match the schema for the header already printed, at which point the user would see a message that effectively tells them to rework their query or explicitly add fuse. There still seemed to be some room for debate on whether zq at the command line should also default to the "always fuse" behavior planned for the Brim app or if the zq default should flip to this more "power user" mode.

philrz commented 3 years ago

Verified in zq commit 1e501e85.

Circling back to the previous behavior, in the last GA zq release tagged v0.26.0, invoking -f csv with heterogeneous data halted the output. Using the zq-sample-data:

$ zq -version
Version: v0.26.0

$ zq -f csv pe.log.gz stats.log.gz 
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
csv output requires uniform records but different types encountered

Now with the benefit of the enhancement, the output proceeds to completion.

$ zq -version
Version: v0.26.0-25-g1e501e85

$ zq -f csv pe.log.gz stats.log.gz 
_path,ts,id,machine,compile_ts,os,subsystem,is_exe,is_64bit,uses_aslr,uses_dep,uses_code_integrity,uses_seh,has_import_table,has_export_table,has_cert_table,has_debug_data,section_names,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,,,,,,,,,,,,,,,,,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
pe,2018-03-24T17:15:54.475076Z,FC6cOXTjuh6OdYwu5,I386,2010-07-12T21:46:18Z,Windows 95 or NT 4.0,WINDOWS_GUI,T,F,F,F,F,T,T,F,F,F,".text,.data,.rdata,.bss,.idata",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:37.127059Z,FBRCYv3eG0d8TEWHy9,AMD64,2011-02-10T08:03:04Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,T,F,T,T,T,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:37.068955Z,FD0dMoWO5DNexKwNb,AMD64,2010-11-20T09:45:06Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,T,F,T,T,F,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
pe,2018-03-24T17:19:36.94092Z,F4ehaa3sa8zKtCWcF9,AMD64,2010-11-20T09:45:00Z,Windows 7 or Server 2008 R2,WINDOWS_GUI,T,T,T,F,F,T,T,F,T,T,".text,.data,.pdata,.rsrc,.reloc",,,,,,,,,,,,,,,,,,,,,,,,,
...
stats,2018-03-24T17:35:20.601137Z,,,,,,,,,,,,,,,,,zeek,282,5467567,3398705931,,,,1535999,1535998,4239,146,305,193639,4731,2510,879701,25895,35230,88,6,0,455128,0,0,0

And if the user is confident their data should conform to a single record definition and hence want to avoid the two passes through the data to guarantee a fuse'd schema, they can invoke the -csvfuse=false option, which in this case would once again halt the output as we saw before.

$ zq -f csv -csvfuse=false pe.log.gz stats.log.gz 
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,,,,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
csv output requires uniform records but different types encountered

Thanks @nwt!