Closed WenyXu closed 1 month ago
Hello, I'm new to the project and I would like to contribute. Not sure if this is a good first issue though.
As I understand it, you're concurrently fetching data from Parquet files using a function that receives a vector of ranges, and your concern is that too many small requests can lead to expensive bills, so an optimization would be to merge ranges that are close together before fetching data, is that right? In this case, what would be a reasonable distance between the ranges? As I am not familiar with the kind of data that is being fetched.
Hello, I'm new to the project and I would like to contribute. Not sure if this is a good first issue though.
As I understand it, you're concurrently fetching data from Parquet files using a function that receives a vector of ranges, and your concern is that too many small requests can lead to expensive bills, so an optimization would be to merge ranges that are close together before fetching data, is that right? In this case, what would be a reasonable distance between the ranges? As I am not familiar with the kind of data that is being fetched.
Thanks @L-Fiori. My bad; my colleague @QuenKar is working on this. I forget to update this issue.
In this case, what would be a reasonable distance between the ranges?
This is key to this issue, and my colleague is doing some benchmarking to figure it out. We will use these benchmark results to select an optimized range distance(and the benchmark results may be posted in related PRs).
Interesting! I'll stay tuned for other issues I might want to tackle, thanks for the reply ;)
@WenyXu This issue seems stale. Do we have other updates?
Or @L-Fiori maybe you can just take over this issue and submit a patch >_<
@WenyXu This issue seems stale. Do we have other updates?
We still send a lot of small IO requests.
_Originally posted by @WenyXu in https://github.com/GreptimeTeam/greptimedb/pull/2959#discussion_r1432230759_