declanvk / blart

https://declanvk.github.io/blart/
Apache License 2.0
28 stars 2 forks source link

Question on usage #40

Open ccollie opened 1 month ago

ccollie commented 1 month ago

I have another use-case i'd like to use blart for. Assume I have a series of chunks bounded by timestamp and containing samples.

pub struct Sample {
  pub timestamp: Timestamp,
  pub value: f64
}

pub struct Chunk {
  pub start: Timestamp,
  pub end: Timestamp,
  pub samples: Vec<Sample>
}

I store multiple chunks per timeseries, and later I need to upsert a Sample in a chunk. For this i need to efficiently locate a chunk into which to insert the sample.

It seems to me that chunks.range(new_sample.timestamp..) won't do the trick, but I can't see any means short of iteration from the start.

Edit: chunks would be indexed by chunk.start

declanvk commented 1 month ago

So you would be using a TreeMap<Timestamp, Chunk> as your index structure? And you mentioned that the key would be chunk.start?

I think it would possibly work if you did chunks.range(..=new_sample.timestamp).rev(). The ..=new_sample.timestamp will give you an iterator of chunks where the chunk.start <= new_sample.timestamp. The .rev() is so that the first element in the iterator will be the most likely match.

One underlying assumption I have is that the byte representation of Timestamp is ordered in the same way that Timestamp is


Somewhat off-topic questions:

For this i need to efficiently locate a chunk into which to insert the sample.

When you insert a new sample into a chunk, do you update the start and end bounds of the chunk? Or is it the case that you'd instead insert a new chunk if the new_sample.timestamp is outside the bounds of any existing chunk?

ccollie commented 1 month ago

So you would be using a TreeMap<Timestamp, Chunk> as your index structure? And you mentioned that the key would be chunk.start?

Yes.Timestamp is essentially an alias for i64

When you insert a new sample into a chunk, do you update the start and end bounds of the chunk? Or is it the case that you'd instead insert a new chunk if the new_sample.timestamp is outside the bounds of any existing chunk?

In this case I'm specifically looking at upserts. Imagine a TimeSeries struct that tracks its range..

pub struct TimeSeries {
  pub start: Timestamp,
  pub end: Timestamp,
  pub chunks: TreeMap<Timestamp, Chunk>
}

Checking an incoming Sample against the range we can determine if we need to upsert or add a new chunk. It is of course possible that a chunk may have to be re-indexed (we also want to handle the case when chunks are imbalanced w.r.t the number of samples)

declanvk commented 1 month ago

Yes.Timestamp is essentially an alias for i64

One thing to note if you're using an i64 (or any unsigned integer) as a key, is that you'll need to apply a mapping so that the ordering of the byte representation matches the ordering of the type. https://docs.rs/blart/latest/blart/struct.ToIBE.html#impl-BytesMapping%3Ci32%3E-for-ToIBE is the impl you're looking for, and you would combine it with the Mapped<_, _> struct like so:

use blart::{TreeMap, Mapped, ToIBE, BytesMapping};

// assumption
type Timestamp = i64;

type Map = TreeMap<Mapped<ToIBE, Timestamp>, Chunk>

This does make keys a little more difficult to use sadly, but not impossible. Let me know about any difficulties in this part or any proposals to make this translation more transparent. Its definitely a rough area of the crate