brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

dataframe processor (fuse) #1269

Closed mccanne closed 4 years ago

mccanne commented 4 years ago

zql should have a dataframe proc to turn a zng stream into a dataframe comprised of one type. This would entail creating an "uber schema" that is the union of all the record types and rewriting each value against the uber schema. This clearly needs two passes.

For an initial implementation, we can buffer all the records in memory and return an error if we hit a limit. This is what will be done here.

As a second step, we can add spilling when the memory limit is exceeded. This will be put in a separate issue.

philrz commented 4 years ago

Verified in zq commit 7f00817.

An easy way to see its benefits is when trying to use the recently-added CSV output format with diverse records.

$ zq -f csv stats.log.gz weird.log.gz 
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,-,-,-,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0
csv output requires uniform records but different types encountered

Adding the fuse processor creates the unifying schema such that now everything can be described via a single, wider CSV header.

$ zq -f csv "fuse" stats.log.gz weird.log.gz | head -10
_path,ts,peer,mem,pkts_proc,bytes_recv,pkts_dropped,pkts_link,pkt_lag,events_proc,events_queued,active_tcp_conns,active_udp_conns,active_icmp_conns,tcp_conns,udp_conns,icmp_conns,timers,active_timers,files,active_files,dns_requests,active_dns_requests,reassem_tcp_size,reassem_file_size,reassem_frag_size,reassem_unknown_size,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,name,addl,notice
stats,2018-03-24T17:15:20.600725Z,zeek,74,26,29375,-,-,-,404,11,1,0,0,1,0,0,36,32,0,0,0,0,1528,0,0,0,-,-,-,-,-,-,-,-
weird,2018-03-24T17:15:20.600843Z,zeek,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,C1zOivgBT6dBmknqk,10.47.1.152,49562,23.217.103.245,80,TCP_ack_underflow_or_misorder,-,F
weird,2018-03-24T17:15:20.608108Z,zeek,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,truncated_header,-,F
...

An addition to the ZQL processor docs is being tracked in #1324.

Thanks @mccanne!