efficient / libcuckoo

A high-performance, concurrent hash table
Other
1.61k stars 275 forks source link

Feature Request: Serialization/Persistence #58

Closed dnbaker closed 5 years ago

dnbaker commented 7 years ago

Have you considered implementing writing to and reading from disk?

I've been looking for a high-performance concurrent hash table for scientific computing applications, and have been using https://github.com/attractivechaos/klib, which is not threadsafe, because serializing khash with fundamental keys and values is relatively trivial, in spite of the performance cost of only allowing one thread to update the table.

Libcuckoo has been great for use in online applications, and it'd be nice to be able to use it for offline applications as well. I imagine others might find persistence useful.

manugoyal commented 7 years ago

Yeah that'd be a great feature! I'm assuming you'd want the serialization to be threadsafe, i.e. the table is locked while the ser/de is occurring? If so, I think adding it to the locked table interface would be ideal. That way you could lock the table, serialize it, and then unlock it. Does that sound like something that'd work for your use case?

dnbaker commented 7 years ago

Right. It'd definitely need to be locked for serialization. I think a LockedTable function would be perfect. I don't know if you'd want to hook it into the constructor of the SingleTable standard map. [Edit: I've spent more time in the cuckoofilter codebase. I must have scrambled the class names.]

For my cases, I'm usually only building a database once, though I would like the flexibility to load, modify, and write back to disk as well. Both of those should be more than fully supported by such an implementation.

manugoyal commented 7 years ago

Right. Already we have a subobject called locked_table, which you can create by calling the lock_table method. So it should be simple to support loading via constructor, loading from locked table, and writing from locked table. Modification can be done through the locked table which is single threaded or through the concurrent hash map which is multithreaded.

Are you only interested in whole table serialization/deserialization and not in per operation serialization (logging)?

dnbaker commented 7 years ago

I was only interested in the table, I think. What do you mean by logging -- status reports during writing?

manugoyal commented 7 years ago

I was thinking like database style snapshotting and logging. So like if you had some strong durability requirement and didn't want to serialize the whole table every time you do some ops, you could take snapshots of the table and then log subsequent operations, then restore from a snapshot and replay operations from the log. On Fri, Jan 27, 2017 at 5:15 PM Daniel Baker notifications@github.com wrote:

I was only interested in the table, I think. What do you mean by logging -- status reports during writing?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/efficient/libcuckoo/issues/58#issuecomment-275763372, or mute the thread https://github.com/notifications/unsubscribe-auth/AAmY6EWcPC91IAHDPzRJVC0hKuLA1X5Wks5rWlBwgaJpZM4LwG_L .

dnbaker commented 7 years ago

I see. Tracking diffs would be much more space efficient than keeping many copies of the database. I think that would be a very appealing feature.

manugoyal commented 7 years ago

Cool! I think the serialization would be a lot easier to implement and can be done first. The logging is more complicated and can be added subsequently. On Fri, Jan 27, 2017 at 5:23 PM Daniel Baker notifications@github.com wrote:

I see. Tracking diffs would be much more space efficient than keeping many copies of the database. I think that would be a very appealing feature.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/efficient/libcuckoo/issues/58#issuecomment-275765301, or mute the thread https://github.com/notifications/unsubscribe-auth/AAmY6PLawKNbyONfDiOtBV62A8-rt--bks5rWlJYgaJpZM4LwG_L .

manugoyal commented 7 years ago

Sorry this took forever. I just pushed some commits (b13ba76054e4675795b961d6af26ef8d1308643a and 1113c15da582b647b1efb0d0fc42fae819383b75) which enable serialization of the table in locked_table mode, on POD types only. Since it looks like your use-case is for C types, I believe this should work for you.

Check out the end of tests/unit-tests/test_locked_table.cc for some brief examples on how to do serialization. Let me know if this works for you.

Thanks! -Manu

dnbaker commented 7 years ago

Wonderful -- thank you! I'll be doing some testing soon. I'm sorry for taking so long to get back to you.

This does not include snapshotting, right? I'm not in a rush for the feature, but it wanted to check while we were discussing it.

manugoyal commented 5 years ago

Hey there! No snapshotting yet. I suppose that would be a fairly involved feature to implement, and I haven't gotten much time recently. Definitely would be great to have though!

dnbaker commented 5 years ago

Fantastic! Thanks for all of your hard work. I'm happy to close the issue for now, but feel free to open it as reminder for incremental changes if you'd prefer.