GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
687 stars 39 forks source link

feat: Parallelize source row chunks to record batches #2095

Open scsmithr opened 1 year ago

scsmithr commented 1 year ago

Currently SQL Server, Postgres, and MySQL data sources iterate over chunks and process row sequentially. We could look into using rayon to parallelize this (in addition to partitioning).

I would think we would want both. The parallelization of multiple partitions would help with larger datasets that require several partitions, and would lend well to distributed execution. Without parallelizing the lower level stuff, we'll likely run into slower compute once we distribute the partitions as each partition would still be processing each column sequentially.

this is definitely not blocking the PR approval, but just something to think about. (it looks like our other sql readers follow the same pattern anyways).

_Originally posted by @universalmind303 in https://github.com/GlareDB/glaredb/pull/2018#discussion_r1391725095_

tychoish commented 11 months ago

Rayon looks cool, and I think parallelism here would be nice.

I've been thinking about this and I think it's worth clarifying if we're proposing reading cursors in parallel (this seems like not particularly useful given that cursors aren't typically thread safe, and it's mostly moving some stuff around in memory, which might suffer when run in parallel. By contrast, decoupling reading from the cursor and moving data from row tuples into record batches, seems like a thing that could be useful to split up.

Also, if the read/write paths of this process are split up then, buffering a little bit (both of incoming rows from the row-stores, and of the assembled record batches) could have an outsize approach on performance and reduce observable latency: I'm thinking a 2-3 record-batches worth of rows/record batches would (unless the consumer of the record batches was wicked fast) keep things moving.

It'd be interesting to know if we could predict if a consumer of rows was going to need to hold an entire resultset in memory, in which case reading rows into some kind of unbounded deque might be pretty good. If the results could be consumed and discarded, then having a fixed size buffer might make more sense.

universalmind303 commented 11 months ago

I think it's worth clarifying if we're proposing reading cursors in parallel (this seems like not particularly useful given that cursors aren't typically thread safe

Generally the way I've seen the parallelism handled for other io connectors in polars & arrow is to try to decouple the IO bound activities from the CPU bound ones as much as possible. So that'd likely entail reading the data as is from the database as quickly as possible (IO), then parallelize the processing of converting the data to arrow & creating of the columns (CPU).

tychoish commented 11 months ago

Generally the way I've seen the parallelism handled for other io connectors in polars & arrow is to try to decouple the IO bound activities from the CPU bound ones as much as possible. So that'd likely entail reading the data as is from the database as quickly as possible (IO), then parallelize the processing of converting the data to arrow & creating of the columns (CPU).

Yup, that makes sense, and I think we're speaking to the same thing. I think "amount of buffering" and RAM use ends up the real variable here.