Open oli-w opened 1 week ago
Hello!
It's a good question, and I'm not certain there will be satisfying answers. I think the two ideas you've proposed run in to other constraints, so let me talk those through.
First, the 'a
lifetime exists to allow references in to storage to outlive the cursor itself, which will continually change as you march around in it. The aim is to allow one to pull down many keys, vals, times, diffs, etc, or references to them, even while the cursor is changing. If you allow the 'a
to depend on the cursors &mut self
lifetime, this all goes away (you can't re-borrow the cursor to mutate it as long as you hold a reference). I think this also confounds the approach of using &'a mut
for storage references, as the mutable borrow also wouldn't be available again as long as the key reference exists.
Minor, but I think the problem is probably not the key
function (which just reads) and can probably be localized into seek_key
and step_key
, which are the moments that one changes which key the cursor is looking at.
But the core problem that I see is that if you want an LRU cache, or something like this, .. it will invalidate some data in response to requests for data it needs to bring in. How does this "lifetime" get communicated to Rust? I think the problem is that it is not a lexical lifetime, which means that Rust is not obviously well equipped to express it. Hypothetically, if you have a one element cache, then every new access is going to invalidate previous elements, and none of the DD code will work anyhow.
An alternate approach, making this up ad lib, is that the Key<'_>
type should behave more like an Rc<_>
. It represents the ability to acquire the data, rather than the data itself. When you try and access it, that might cause some cache replacement work to happen behind the scenes. But as long as you hold it you are certain to be able to access through it.
This is a bit like what we've done a few times with DD, which is "let virtual memory handle it". If you treat it as just .. data that Rust has, potentially lots more than you have physical memory, there are a few controls that let you respond to "ack; I don't have the data mapped in". Virtual memory is one, and I think @antiguru may be able to speak to how lg_alloc
ties in and acts as a more intentional paging layer.
I'll keep pondering, but it does seem tricky to both 1. maintain "references" that live longer than the borrow of the cursor, and 2. adapt resources in response to the cursor position. If you see other idioms out there where folks have succeeded with this, lmk!
An alternate approach, which is a pretty substantial (but not unmotivated) re-design, would be to have DD expose the ability to chunk data more explicitly. Right now you can partition data between workers, which is useful. But perhaps each worker would further like to chunk their data down to smaller key ranges, say. This would allow them to "page in" the data for a chunk of keys, do that work without modifying the backing store, and then release those resources as it moves on to other chunks of keys.
It doesn't address the question of what if the data for one key is too large to fit, but I guess that will always exist (as long as one is allowed to say "I only have 1KB; what now").
Ahah! Thank you, this all makes sense.
An alternate approach, making this up ad lib, is that the
Key<'_>
type should behave more like anRc<_>
. It represents the ability to acquire the data, rather than the data itself.
This works for me 😃. I store data on files and can promise that it can be accessed at any time during program execution. I was originally thrown by the Copy
constraint, but realise that I can create a KeyRef<T>
which just stores a few Copy
-able properties: a file index, range index and key index. I implemented an LruCache
and file metadata with a few thread-local statics. This makes joining with these arrangements a little bit trickier since the other arrangement also needs to have the same KeyRef
type, but I've got some ideas to play around with.
Keen to hear more about lg_alloc
integration and experience using mmap
. I read some of the code in Materialize for hooking up lg_alloc
- it does seem really neat to just pretend like you've got tons of memory and let the OS page to disk.
Ah, the Copy
constraint may be negotiable. I think it emerged from the generalization from &Key
to Key<'_>
and the thought was "this will keep us on track" more than "this is a rule!". I suspect it could be relaxed to Clone
without much concern, though with the understanding that this might happen not uncommonly as buffer of these things get slung around.
I'm trying to implement disk-backed arrangements, where updates are stored as immutable batches on disk and chunks of the batch are loaded into memory as you seek around the cursor.
I have set up:
DiskBatch
that holds a path to a file containing the data, and a mapping of key ranges (chunks) to offsets in the file. TheBuilder
that outputsDiskBatch
collects and sorts the data in memory and then writes it out to disk inBuilder::done
.DiskCursor
that implementsCursor
and usesDiskBatch
as theStorage
associated type, withrange_pos
,key_pos
andval_pos
fields (similar toOrdValCursor
).The problem I have run into is trying to add the caching layer - ideally an LRU cache of chunks. The
key
method ofCursor
is this:Let's say for example that the
Key<'a>
associated type is hard-coded to&'a String
. The lifetime on&'a Self::Storage
and the return type means that the "cache" must belong to the storage. Since there isn't a mutable reference to storage, I believe the only option is to use aRefCell
(maybe around aHashMap
), but I can't return a&'a
reference to content inside aRefCell
-RefCell::borrow()
returns aRef
with a lifetime that only lasts until the end of the current block.I tried changing
Key<'a>
to an ownedString
orRef<String>
, but this is not possible because these don't implementCopy
.Ideas:
&self
to&mut self
on key and other relatedCursor
methods, and update the lifetimes to allow theKey<'a>
to be returned from the lifetime attached toself
/Cursor
. Or maybe just update the lifetimes without making the referencemut
- mutation can occur during step / seek / rewind. I instinctively thought that theCursor
would be the most appropriate place for a cache, since it already holds other mutable values, while keepingStorage
immutable at all times.&'a Self::Storage
to&'a mut Self::Storage
, allowing the cache to be implemented withHashMap
orlru::LruCache
, without theRefCell
wrapper.Please let me know if I've missed some other way to achieve this. This worked really nicely for
Copy
types likeu32
, but I've had no luck with other types.