Closed Hoverbear closed 9 years ago
Is there a reason that you want the data to be line-separated? If you treat this file as just a binary blob, you could skip the Base64 step and get good performance and compression (if you want compression).
@TyOverby There is no particular reason! One concern with the approach of a blob is you'd have to store the entire thing in memory to work with it.
If the entries that are being stored are constant in size, then you could seek to the entry_size * i
byte in the file and then read from there.
If that isn't an option, you could have two files. One with a binary blob, one with a hashmap of EntryNumber
to Position
, where position is a byte offset into the file.
With either of these, reads and writes will be absurdly fast, as you don't need to scan through the lines, and you just deserialize the entries that you want.
@TyOverby Hmm, I don't think I can make a constant size assumption, I'm trying to keep the library as generic as possible!
The hashmap idea is really clever, I like it! I guess you could presumably store the hashmap in memory even since it would be a constant (u64, u64)
, or at least a good hunk of it. Since most updates to the hashmap would be fairly small, is there a good way to "sync" the file?
Yeah, you would want to leave the "hashmap" in memory. On disk, it would probably be a list of (u64, u64)
. Because (u64, u64)
is constant size, you can use direct-indexing into that if you wanted to do an update, and appending to it would be as easy as appending to the file.
I'm not sure that this is the most elegant way to do things. A database is probably much more suited to the task.
@Hoverbear, have you looked at using a key/value storage engine like LevelDB, LMDB or WiredTiger? I'm unsure of whether those have Rust bindings in crates.io yet, but if they do I think they'd suit your use case. They're all built on more or less the same core API that lets you:
In your case, the keys would be line numbers and the values would be the line contents. Everything would be written to a single file per log with no requirement to maintain an in-memory data structure -- you could allow the log to outgrow RAM.
@TyOverby Yes I think that might be the right solution, I just need to pick something simple and embedded since I have fairly limited requirements.
@zslayton Yes this was discussed on Reddit!. It looks like there are LMDB bindings. Thank you!
I think that finding some sort of embedded database that is simple and cheap enough is the way to go.
redis-rs might be worth a look too.
There's also these bindings to LMDB https://github.com/vhbit/lmdb-rs
@danburkert had some awesome ideas here and contributed a patch.
This by @TyOverby would probably be useful instead of doing JSON -> Base64.
It might be worthwhile to look into the possibility of bincoding a
Vec<(u64, T)>
and somehow syncing it to disk instead of doing it line by line. One downside of this is that we'd need to keep the entire log in memory, and that sucks!