Merge the small read io

GreptimeTeam / greptimedb

An open-source, cloud-native, unified time series database for metrics, logs and events with SQL/PromQL supported. Available on GreptimeCloud.

https://greptime.com/

Apache License 2.0

4.18k stars 298 forks source link

Merge the small read io #3072

Closed WenyXu closed 1 month ago

WenyXu commented 8 months ago

          Too many small requests could result in an expensive bill; there is an optimization we can do in the future(maybe not in this PR) for object stores like s3. If the `ranges` are almost continuous, we can merge these ranges into a large chunk and fetch this chunk in the preferred size concurrently.

_Originally posted by @WenyXu in https://github.com/GreptimeTeam/greptimedb/pull/2959#discussion_r1432230759_

WenyXu commented 8 months ago

See also:

L-Fiori commented 8 months ago

Hello, I'm new to the project and I would like to contribute. Not sure if this is a good first issue though.

As I understand it, you're concurrently fetching data from Parquet files using a function that receives a vector of ranges, and your concern is that too many small requests can lead to expensive bills, so an optimization would be to merge ranges that are close together before fetching data, is that right? In this case, what would be a reasonable distance between the ranges? As I am not familiar with the kind of data that is being fetched.

WenyXu commented 8 months ago

Hello, I'm new to the project and I would like to contribute. Not sure if this is a good first issue though.

As I understand it, you're concurrently fetching data from Parquet files using a function that receives a vector of ranges, and your concern is that too many small requests can lead to expensive bills, so an optimization would be to merge ranges that are close together before fetching data, is that right? In this case, what would be a reasonable distance between the ranges? As I am not familiar with the kind of data that is being fetched.

Thanks @L-Fiori. My bad; my colleague @QuenKar is working on this. I forget to update this issue.

In this case, what would be a reasonable distance between the ranges?

This is key to this issue, and my colleague is doing some benchmarking to figure it out. We will use these benchmark results to select an optimized range distance(and the benchmark results may be posted in related PRs).

L-Fiori commented 8 months ago

Interesting! I'll stay tuned for other issues I might want to tackle, thanks for the reply ;)

tisonkun commented 4 months ago

@WenyXu This issue seems stale. Do we have other updates?

Or @L-Fiori maybe you can just take over this issue and submit a patch >_<

WenyXu commented 4 months ago

@WenyXu This issue seems stale. Do we have other updates?

We still send a lot of small IO requests.