Closed alfred-landrum closed 4 years ago
GitHub is doing its odd thing where it's not letting me pick certain PRs from the "Linked pull requests" drop-down, so I'll just note in this comment that in addition to the two already linked, #782 also applies.
Here's a record of the before/after from this effort. We had a target to turn 3 GB of snappy-compressed Parquet files into their ZNG equivalent in under 5 minutes. We had what we believed to be a representative 2.2 MB snappy-compressed Parquet file that had begun life as 60k Zeek DNS that had been sent through a Nifi dataflow to transform the events and apply a particular schema as the records were output as Parquet.
To test, 1390 copies of the 2.2 MB file were made, then batches of them were run through parallel counts of zq -i parquet $file > /dev/null
on a 16-core Google Cloud VM.
As the following chart shows, with 10 parallel zq processes, performance originally plateaued only slightly better than 500 seconds to make it through the whole set of files (and we'd have wanted to get below 300).
At that point the host was basically maxed out. top
was showing <5% idle CPU with all the zq
processes consuming equally. There was no I/O wait registered (i.e. processes seemed able to read from the SSD without complaint).
After the iterations in the linked PRs, eventually landing at zq
commit 51553cf
, performance was significantly improved.
To cherry-pick a particular data point, with only 6 parallel zq
processes, we're now making it through the 3 GB of Parquet files in 78 seconds. This also was possible while consuming significantly less CPU, as only 36% of the total host CPU was consumed in obtaining this throughput.
We should investigate our parquet reader & parquet-go library to see what optimizations we could apply for both parquet generally & for s3 access.