Draft for storing the offline messages

ghost commented 7 years ago

I just created a concept of an offline message store and want to know what you think about it and if I should keep implementing it. All data in this draft is stored in the file system, so even low power devices like the Raspberry Pi can store a lot of data. Every offline-message data is identifyed by a key with, for example, 32 bytes and has a time frame for how long it should be stored. Let the root folder be "root". The root folder contians further folders which recreate a b-tree, but only as deep as needed to make a unique path to every single offline-message file.

Example: A file with the key "abcd..." would be found by the path "root\a\b\c\d..." If there are for example only two keys "abcd..." and "aabb..." in the offline store then we only need a path to "root\a\b" and "root\a\a" because that would be enough to distinguish the two files. This avoids folders which only contains 1 folder and saves space. So every folder in the b-tree has either a file or at least two other folders (branches). Every offline message is stored with two files: "info.bin" and "data.bin". "info.bin" contains the key and the time point until when the data has to be stored and "data.bin" contains the offline-message data. So a example folder could look like this: "root\a\b\info.bin, data.bin".

How can the files be deleted efficiently at the right time so the offline-message store does not store files after their time-out ?

I think this is the most important (innovative?) part of this design. In the root folder (or maybe somewhere else, not sure yet) there are files with the date and time as file name. For example "1.January1:00", "1.January2:00", "1.January3:00", ... (The exact time gap has to be figured out). They are created if a timeout of an offline message ends in the appropriate time frame. So every offline message key will be written in the appropriate file. If the clock has passt, for example, the 1.January, 2:00 , the program will search the file with the name "1.January1:00" and deletes all the files which belong to the keys in this file. If the clock pass the 1.January, 3:00 , all files which belong to the keys in the file "1.January2:00" are going to be deleted (The old date-files with the keys will be deleted, too).

Thoughts about this design: I think it will scale very well, because there is nearly no Ram needed. There is no need for a linked list or something like that to store the timeout-order in RAM. Anyway, the performance depends on the file-system. I think for file systems like btrfs or maybe APFS the offline message-store has a complexity of O(logN) which is limited through their file-name look-up. I do not think that it is a good idea to explicitely delete files when a tox-node signals that it has received everything, because this is just no reliable way and so a "back up" solution is needed anyway.

The API in c could look like this: void initDatastore(Datastore ds); void clearDataStore(Datastore ds); bool datastore_storeData(Datastore ds, uint8_t key[DATASTORE_KEYSIZE], unsigned int seconds, size_t size, void data); bool datastore_retrieveData(Datastore ds, uint8_t key[DATASTORE_KEYSIZE], void buffer); size_t datastore_getSizeOfData(Datastore ds, uint8_t key[DATASTORE_KEYSIZE]); size_t datastore_getFreeSpace(Datastore ds);

This is a pretty early draft, what do you think about it?

anoadragon453 commented 7 years ago

Whoops, this got lost in my notifications, will read over in a bit.

GrayHatter commented 7 years ago

It took me way to long to parse the file names... I thought you were \escaping the letters, not referencing directories.

Either way, scaling really isn't the issue at hand for something like this... and SQLite has already more or less solved on disk storage for small/embedded/low-power devices. But even if they hadn't toxcore doesn't touch the disk, and never should. Which means storing on disk is, IMO, not even something we could consider.

anoadragon453 commented 7 years ago

@Crymes Yeah, not sure I'm too sold on the file-system based design as well, but that's not to say the methods of searching and retrieving data can't continue to work inside of a database.

As far as the expirey times go, we can just add a column to each row that represents at which point the message will expire, and perhaps check/remove all expired messages every minute.

Eating up RAM shouldn't be a problem even for crazy amounts of stored messages since as @GrayHatter mentioned, a database scales pretty well. Using a database also provides more consistency as you never know how a file-system is setup/will act as there's just so many different types and configurations.

I like the idea of better ways to organize the data and searching for it than simply iterating through a database table, though I think some of the biggest issues for getting v1.0 of the proposal out the door has been keeping data safe even if a storage node suddenly disappears, as well as how nodes should communicate with each other and how to prevent spam and other denial of service attacks, so ideally these are what we should focus on for the time being. There's a few ideas that were in the etherpad that haven't been put in to the proposal so I'll find some time to migrate those over and clean everything up soon.

Super grateful for doing some digging on all this though, and wouldn't mind hearing any more ideas you have.

GrayHatter commented 7 years ago

@anoadragon453 you don't lack for people willing to store huge blocks of information... In the next spec I see, I'd suggest making most nodes store only an address/pointer to the node who's actually holding the message. Then, the network as a whole can support the finding of offline messages. Then messages of any size can be stored on someone with a few extra gigs of disk space, and want's to run a offline_message_node.

anoadragon453 commented 7 years ago

The data would need to be stored across multiple nodes in case the one holding a good amount of them fails or goes offline. As far as having pointers, as long as the data has some sort of content-addressable hash that we can find by going through node after node, getting ever closer, we should be good.

Nodes that are storing the same information don't need to point to each other either, as once you hit them you'll already have the data.

I think in this spec we intended for offline nodes to have a configurable max_size of data that they can store, with ~5MB being the default. People with tons and tons of space can just raise that limit.

On 06/30/2017 09:54 PM, Gregory Mullen wrote:

@anoadragon453 https://github.com/anoadragon453 you don't lack for people willing to store huge blocks of information... In the next spec I see, I'd suggest making most nodes store only an address/pointer to the node who's actually holding the message. Then, the network as a whole can support the finding of offline messages. Then messages of any size can be stored on someone with a few extra gigs of disk space, and want's to run a offline_message_node.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/anoadragon453/tox-offline-messaging-prop/issues/3#issuecomment-312410627, or mute the thread https://github.com/notifications/unsubscribe-auth/ABR7mMBxuZG_ctG1BgV3KoKtTI46Ju6-ks5sJdD-gaJpZM4OHL7r.

ghost commented 7 years ago

Thanks for your feedback. I think the main category of devices handling the offline stores are phones which charge over night and are connected to an wifi network ? I did not know about SQLite before but is seems to be a simple and extensible solution because it also could store data exclusively in RAM. Your draft sounds like every restart of toxcore/the device would delete all offline messages. Is this for security reasons ? I expect the average storage time to be 24 hours and there likely are some restarts of devices storing the messages. If we go with SQLite will there be a file like datastore.c and datastore.h which encapsulates the database handling or is it likely to be used straight in the code of your proposed 2. DHT chord-style network ? Either way I think there is a lot less work left with SQLite.

anoadragon453 commented 7 years ago

Your draft sounds like every restart of toxcore/the device would delete all offline messages. Is this for security reasons ?

I think that was from a misunderstanding from myself early on, upon hearing that toxcore never wrote to disk, I assume that once it shut down all data should vanish. Of course that makes no sense, the clients just seed toxcore with the data on startup.

It could've also been due to clients that go offline for a while will come online with stale data, but an expiry date on messages will take care of that.

That line should probably be removed/updated.

anoadragon453 / tox-offline-messaging-prop

Draft for storing the offline messages #3