brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

parquet reader profiling and optimization #796

Closed alfred-landrum closed 4 years ago

alfred-landrum commented 4 years ago

We should investigate our parquet reader & parquet-go library to see what optimizations we could apply for both parquet generally & for s3 access.

philrz commented 4 years ago

GitHub is doing its odd thing where it's not letting me pick certain PRs from the "Linked pull requests" drop-down, so I'll just note in this comment that in addition to the two already linked, #782 also applies.

philrz commented 4 years ago

Here's a record of the before/after from this effort. We had a target to turn 3 GB of snappy-compressed Parquet files into their ZNG equivalent in under 5 minutes. We had what we believed to be a representative 2.2 MB snappy-compressed Parquet file that had begun life as 60k Zeek DNS that had been sent through a Nifi dataflow to transform the events and apply a particular schema as the records were output as Parquet.

To test, 1390 copies of the 2.2 MB file were made, then batches of them were run through parallel counts of zq -i parquet $file > /dev/null on a 16-core Google Cloud VM.

As the following chart shows, with 10 parallel zq processes, performance originally plateaued only slightly better than 500 seconds to make it through the whole set of files (and we'd have wanted to get below 300).

image

At that point the host was basically maxed out. top was showing <5% idle CPU with all the zq processes consuming equally. There was no I/O wait registered (i.e. processes seemed able to read from the SSD without complaint).

After the iterations in the linked PRs, eventually landing at zq commit 51553cf, performance was significantly improved.

image

To cherry-pick a particular data point, with only 6 parallel zq processes, we're now making it through the 3 GB of Parquet files in 78 seconds. This also was possible while consuming significantly less CPU, as only 36% of the total host CPU was consumed in obtaining this throughput.