ZeroIntensity / discobase

Database library using nothing but Discord. PyDis Codejam 2024.
https://discobase.zintensity.dev/
MIT License
6 stars 0 forks source link

Storage Mechanism #4

Closed ZeroIntensity closed 1 month ago

ZeroIntensity commented 1 month ago

So far, we've only speculatively thought about how we're going to actually store our data. We need to set this in stone to write an actual implementation.

Storage Hierarchy

Data will be stored in the following hierarchy, based on #3 (highest to lowest level):

Entry Format

Every message/database entry should simply be (possibly compressed) URL-safe base64 pickled data (though, done with dill, not pickle).

Rubiks14 commented 1 month ago

Are we going to be loading the full 'table' into memory to then query it or are we going to have some way to query the channel (table) directly?

ZeroIntensity commented 1 month ago

Probably a bit of both. We can use an in-memory cache for faster lookups, and just update the cache upon edit.

ZeroIntensity commented 1 month ago

The biggest problem with the current mechanism is that there isn't any clear mechanism for querying based on existing data. For example, if the stored content is:

{"one": 1, "two": 2, "three": 3}

What would be the ideal way to find this entry if e.g. the query was {"one": 1}. Right now, it seems like we might have to resort to an O(n) search that deserializes every entry along the way (which will be extremely slow). I do have a solution, but we would have to force documents to follow a specific schema (which would be user-defined). This might be beneficial anyway, so it's the solution I'm leaning towards, but let me know if you think of anything else.

Gimpy3887 commented 1 month ago

would have to force documents to follow a specific schema (which would be user-defined)

I agree with this.

ZeroIntensity commented 1 month ago

Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g. user.posts would have its own channel). If we're queried with {"one": 1}, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages.

The entry structure would look like this, per channel:

{"ids": [], "content": "..."}
Rubiks14 commented 1 month ago

I thought of a way that we could potentially index entries.

We could have the latest message in the channel contain an index which lets us at least narrow down our search. Say we have a field that is name then we have the index have A-Z with the message ID for all names that start with each letter stored under that letter in the index. When entries are added we modify the index, delete the old index message and store the new one at the end of the channel.

Then we aren't deserializing everything but instead just a few messages. We could make an index for each field if we needed to then consult the message that corresponds to the significant field.

enskyeing commented 1 month ago

Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g. user.posts would have its own channel). If we're queried with {"one": 1}, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages.

The entry structure would look like this, per channel:

{"ids": [], "content": "..."}
  • ids contains the message IDs of the other parts
  • content contains the value only, not the key. In the example above, this would be 1, pickled and base64 encoded.

I haven't entirely fleshed this idea out yet, but we could potentially counter storing the ID's by using discord's message references, so we can store each Entry in the Table channel, but use message replies to link them all together, so all the information is linked to each other through references. This way we can use Discords search feature to find the entry of {"one": 1}, and then find all the data that is connected to it through message references. I'm not entirely sure if the message references work this way, but might do some testing to see if it could be applicable to this.

API reference link: https://discordpy.readthedocs.io/en/stable/api.html?highlight=message#discord.Message.reference

Edit: Something like this. It's not the prettiest though if we want the user to be able to access the db through the channel list. image

ZeroIntensity commented 1 month ago

I thought of a way that we could potentially index entries.

We could have the latest message in the channel contain an index which lets us at least narrow down our search. Say we have a field that is name then we have the index have A-Z with the message ID for all names that start with each letter stored under that letter in the index. When entries are added we modify the index, delete the old index message and store the new one at the end of the channel.

Then we aren't deserializing everything but instead just a few messages. We could make an index for each field if we needed to then consult the message that corresponds to the significant field.

I think we would still have to require a schema for that.

ZeroIntensity commented 1 month ago

Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g. user.posts would have its own channel). If we're queried with {"one": 1}, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages. The entry structure would look like this, per channel:

{"ids": [], "content": "..."}
  • ids contains the message IDs of the other parts
  • content contains the value only, not the key. In the example above, this would be 1, pickled and base64 encoded.

I haven't entirely fleshed this idea out yet, but we could potentially counter storing the ID's by using discord's message references, so we can store each Entry in the Table channel, but use message replies to link them all together, so all the information is linked to each other through references. This way we can use Discords search feature to find the entry of {"one": 1}, and then find all the data that is connected to it through message references. I'm not entirely sure if the message references work this way, but might do some testing to see if it could be applicable to this.

API reference link: https://discordpy.readthedocs.io/en/stable/api.html?highlight=message#discord.Message.reference

Edit: Something like this. It's not the prettiest though if we want the user to be able to access the db through the channel list. image

I like this idea a lot! Is there a way to see what messages are referenced by another? In your example, the first message wouldn't be able to see the other parts of the data.

enskyeing commented 1 month ago

Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g. user.posts would have its own channel). If we're queried with {"one": 1}, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages. The entry structure would look like this, per channel:

{"ids": [], "content": "..."}
  • ids contains the message IDs of the other parts
  • content contains the value only, not the key. In the example above, this would be 1, pickled and base64 encoded.

I haven't entirely fleshed this idea out yet, but we could potentially counter storing the ID's by using discord's message references, so we can store each Entry in the Table channel, but use message replies to link them all together, so all the information is linked to each other through references. This way we can use Discords search feature to find the entry of {"one": 1}, and then find all the data that is connected to it through message references. I'm not entirely sure if the message references work this way, but might do some testing to see if it could be applicable to this. API reference link: https://discordpy.readthedocs.io/en/stable/api.html?highlight=message#discord.Message.reference Edit: Something like this. It's not the prettiest though if we want the user to be able to access the db through the channel list. image

I like this idea a lot! Is there a way to see what messages are referenced by another? In your example, the first message wouldn't be able to see the other parts of the data.

That was a limitation I've been trying to figure out from the API, but haven't found it yet. A potential work around might be once the message is found, for example the first user_id, then it can check following messages in the channel to gather all the data, but I'm not sure if it's possible to locate the following messages based on the location of a message in the API, it seems like this might just cause a channel message history limit problem once the db gets large enough. Gonna keep looking and hopefully find something that works with the references, otherwise we might have to resort to the columns as channels idea.

Rubiks14 commented 1 month ago

Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g. user.posts would have its own channel). If we're queried with {"one": 1}, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages. The entry structure would look like this, per channel:

{"ids": [], "content": "..."}
  • ids contains the message IDs of the other parts
  • content contains the value only, not the key. In the example above, this would be 1, pickled and base64 encoded.

I haven't entirely fleshed this idea out yet, but we could potentially counter storing the ID's by using discord's message references, so we can store each Entry in the Table channel, but use message replies to link them all together, so all the information is linked to each other through references. This way we can use Discords search feature to find the entry of {"one": 1}, and then find all the data that is connected to it through message references. I'm not entirely sure if the message references work this way, but might do some testing to see if it could be applicable to this.

API reference link: https://discordpy.readthedocs.io/en/stable/api.html?highlight=message#discord.Message.reference

Edit: Something like this. It's not the prettiest though if we want the user to be able to access the db through the channel list. image

Do we have to store each field in a separate message? Could it not all be stored in a single message. Then we can do much like relational databases do and each message from another table that is related to the base message just stores the channel ID and message ID of it. The channel ID may not need to be stored.

ZeroIntensity commented 1 month ago

Could it not all be stored in a single message.

Likely not, because of querying.

Rubiks14 commented 1 month ago

Likely not, because of querying.

What about threads. I'm not sure what the limitations are on threads but could we utilize them to store related data?

Never mind. Threads auto archive

ZeroIntensity commented 1 month ago

but I'm not sure if it's possible to locate the following messages based on the location of a message in the API

I think we might be able to, and it won't be terrible on performance. If store a separate ID (generated by us) with each message, then we can search for that ID and then query all results to finalize the fetch.

For example:

123 {"one": 1}
123 {"two": 2}
123 {"three": 3}

If we search for 123, it will come up with all the results. (Note that we don't have to worry about messages that store 123 in the content, because it will be pickled and encoded anyway.)

Rubiks14 commented 1 month ago

but I'm not sure if it's possible to locate the following messages based on the location of a message in the API

I think we might be able to, and it won't be terrible on performance. If store a separate ID (generated by us) with each message, then we can search for that ID and then query all results to finalize the fetch.

For example:

123 {"one": 1}
123 {"two": 2}
123 {"three": 3}

If we search for 123, it will come up with all the results. (Note that we don't have to worry about messages that store 123 in the content, because it will be pickled and encoded anyway.)

From what I can tell there doesn't appear to be a way to search a channel for a message containing certain text with discord.py or the discord API.

ZeroIntensity commented 1 month ago

Oh. That's an issue.

ZeroIntensity commented 1 month ago

We'll have to make a hashtable out of the channel, then. The last message will contain the length, and we'll hash IDs and turn them into indicies.

Rubiks14 commented 1 month ago

If we are agreed on making a schema for each table then the last message could contain the total number of messages in the channel and the schema for the data stored.

We could also utilize the index messages that I talked about before. The ID for each index message could be stored in the schema alongside the field being indexed. Not every field would need to be indexed but we could allow users to create indexes for heavily searched fields.

How we determine bucket sizes for the index would depend on the type of data but a general A-Z for text, number ranges for integer/real numbers, and date ranges for dates would probably suffice for this type of project.

This would mean that we wouldn't have to keep the whole database in memory all the time and only grab it when a query is being done against non indexed data. Or we could keep it in memory still but only search it when the queried data is not indexed.

Relationships can be indexed as well with each message ID from Table A being stored as an indexed field in Table B with a list of message IDs underneath it that reference the related message from Table A like a foreign key.

Having the index messages referenced in the schema/count message means we only have to worry about leaving that one message at the end of the channel. We could potentially store the indexes in their own channel then.

Edited to attempt to fix the grammar.

Rubiks14 commented 1 month ago

We could also pin the message that contains meta data about the table. You can search for pinned messages separately.

That way we don't have to worry about constantly deleting and recreating the message to make sure it is the last one.

ZeroIntensity commented 1 month ago

If we are agreed on making a schema for each table then the last message could contain the total number of messages in the channel and the schema for the data stored.

We could also utilize the index messages that I talked about before. The ID for each index message could be stored in the schema alongside the field being indexed. Not every field would need to be indexed but we could allow users to create indexes for heavily searched fields.

How we determine bucket sizes for the index would depend on the type of data but a general A-Z for text, number ranges for integer/real numbers, and date ranges for dates would probably suffice for this type of project.

This would mean that we wouldn't have to keep the whole database in memory all the time and only grab it when a query is being done against non indexed data. Or we could keep it in memory still but only search it when the queried data is not indexed.

Relationships can be indexed as well with each message ID from Table A being stored as an indexed field in Table B with a list of message IDs underneath it that reference the related message from Table A like a foreign key.

Having the index messages referenced in the schema/count message means we only have to worry about leaving that one message at the end of the channel. We could potentially store the indexes in their own channel then.

Edited to attempt to fix the grammar.

I'm not fully understanding (partially because I'm sleepy), could you elaborate and add some examples?

Rubiks14 commented 1 month ago

I'll try to give an example to see if it helps clarify my thought process. I may be over complicating this and if I am then please feel free to let me know.

Channel(Table): bookmarked_post

---------------------------------
message_id: 83234829342342 <--- The message ID that is generated by discord. not actually stored in the message
user_id: 23478234823423 <---- The id of the user who bookmarked the post. This could be the user_id created by discord
post_author: rubiks14
post_content: blah blah blah
---------------------------------
message_id: 67495345034594
user_id: 76359374573494
post_author: zerointensity
post_content: blah blah blah
---------------------------------
message_id: 46136468476612
user_id: 23478234823423
post_author: enskyeing
post_content: blah blah blah
---------------------------------
message_id: 53745934053453
user_id: 92349234829342
post_author: enskyeing
post_content: blah blah blah

Also in the channel we have a pinned message that contains the meta data for the table

---------------------------------
user_id: { data_type: bigint, relationship_message_ids: [5345734053407342,] } <--- index location stores the message ID of the index
post_author: { data_type: string, index_message_ids: [45349053450343045,] } <--- author of the post that was bookmarked
post_content: {data_type: string, index_message_ids: [] } <---- not indexing message content (we'd probably need a library to properly do that)
total_entries: 4 <---- how many items are stored in this table
---------------------------------

Now in another channel we could have the messages that contain the indexes. The indexes are used to make searching possible.

This would index all posts bookmarked by each user

relationship on user_id - message_id: 5345734053407342 <--- this message id is stored in the table meta data
---------------------------------
user_id: 23478234823423 <--- user_id of the person who bookmarked the message
  message_ids: [
       83234829342342 < ---- message_id containing entry data about a bookmarked post
       46136468476612
  ]

user_id: 76359374573494
  message_ids: [
      67495345034594
  ]

user_id: 92349234829342
  message_ids: [
    53745934053453
  ]
---------------------------------

This would index bookmarked posts by their author.

search_index on post_author - message_id: 45349053450343045 <--- this message id is stored in the table meta data
---------------------------------
post_author: rubiks14
  message_ids: [
     83234829342342 <--- message_id of an entry where rubiks14 is the post_author
  ]

post_author: zerointensity
  message_ids: [
     67495345034594
  ]

post_author: enskyeing
  message_ids: [
    53745934053453
    46136468476612
  ]
---------------------------------

When we initialize the database we get the meta data and the index locations from the pinned meta data post. The content of each index message is then turned into dictionaries where the field value is the key and the set of message_ids is the value.

This example has the post_author as the bucket label for that particular index, but the bucket label could be something more generic and more values could be placed inside the bucket. This would require a search through the bucket, but it would be a subset of the whole dataset so the search would be faster.

So for this example if we're looking for all bookmarked posts authored by rubiks14 and saved by user_id 23478234823423 we do a set intersection of the user_id relationship index and the post_author index.

I don't know what the character limit is on a message but if the index gets too big an additional message can be created for overflow and then attached to the metadata.

I hope that makes more sense. I'm also tired so I'm not the best at explaining it right now.

I have made some edits to hopefully make this easier to read. I usually need a couple of drafts to finally get something to where I feel like it is clear.

Rubiks14 commented 1 month ago

Sleep does the body good. After sleeping I realize that this is me trying to optimize lookup time too early. This definitely uses more memory than storing the full table in a hash.

At least for initial testing and prototyping it would be sufficient to store the whole table in a dictionary/hash with the message id as the key and the message content as the value. Then we do an iterative search whenever we need to query specific data.

We can use discord.TextChannel.history to initialize the dictionary.

ZeroIntensity commented 1 month ago

I think I'm not seeing this clearly, still. The message author isn't really useful, since it will always be the bot's ID (we won't be sending messages manually).

Rubiks14 commented 1 month ago

Im sorry I didn't specify the table details. This table is an example table that stores messages that a user has bookmarked. The bookmarked data contains the user id of the user that bookmarked the message, the author that posted the original message being bookmarked, and the content of the message being bookmarked. The table itself could be anything. The metadata and indexes refer to the fields that the table has.

ZeroIntensity commented 1 month ago

we determine bucket sizes for the index would depend on the type of data but a general A-Z for text

FWIW, all user data will be encoded in base64, meaning you only have to worry about those characters.

ZeroIntensity commented 1 month ago

I'm not really a fan of using the pinned messages for "metadata," it should just be the most recent message that stores it (and gets deleted and replaced with every update). Otherwise, the only way to differentiate between if a message is the metadata or an actual entry is by checking if it's in the pinned list, which won't scale well with O(n).

enskyeing commented 1 month ago

We could have a separate metadata channel for each table. Assuming each table would have an id, we could have <table-id>-metadata and <table-id>-contents as channels per table. That way the metadata never gets in the way of our table data and we don't have to worry about fumbling with deleting and resending messages every time, just editing in any new information. It would also allow for additional metadata messages to be sent if the original message meets Discords character limit which iirc is around 2000chars.

Rubiks14 commented 1 month ago

I thought about making the metadata post the first post that is created when the table is created. That way it is always the oldest but I like the idea of doing a separate channel for metadata better.

ZeroIntensity commented 1 month ago

You'll have to explain your idea to me over call, @Rubiks14. My solution with key-per-channel is as follows:

Metadata is undecided, the last message or another channel could work. Message entries will look like this (apart from indentation):

{
    "references": /* message ID or null */,
    "content": "base64 pickled data for 1",
    "key": "key of the data, as a double check in case of a hash collision",
    "parts": [/* message IDs of other keys, e.g. two and three if this was one */]
}

The key with O(1) lookups here is to turn this into a hash table. We take the length from the metadata, however we end up doing that, and then use a hashing algorithm on the base64 string. For Python's hash(), we could use the following formula to calculate an index: (hash & 0x7fffffff) % size. Although, the issue with hash() is that it's based on a random seed set at startup time. We can get around this with the PYTHONHASHSEED environment variable, but then this makes our database susceptible to timing attacks. Likely, we'll have to generate a hash seed for when the database is initially created, and then store that in the metadata, and reuse it for every program.

Gimpy3887 commented 1 month ago

We could have a separate metadata channel for each table.

I like this idea, just seeing metadata in a channel as a message with your data just doesn't feel right: also, we're just going to end up having double the channels every time as a drawback if we create a new metadata channel for each table.

I think the best idea could be just to have a channel full of metadata that channel could just be a forum channel to help with organization. But, I haven't exactly checked what functions we'd need to use on d.py to have control over forum channels: I think they still use thread functions for it because forums are just threads iirc.

ZeroIntensity commented 1 month ago

I like the metadata channel idea as well. I don't see why we would need a metadata channel per-table, a single one that contains the metadata for each channel should do fine. We don't have to think about this too much, this can be an O(n) search since the number of tables generally won't be very high.

ZeroIntensity commented 1 month ago

A format like this should suffice:

{
    "name": "table name",
    "length": 0,
    "schema": "pickled pydantic schema",
}

Note that we won't actually be using the pickled schema at runtime, it's just an extra check to make sure that it matches. If it doesn't, we'll either have to run a migration or throw an error.

Rubiks14 commented 1 month ago

You'll have to explain your idea to me over call, @Rubiks14. My solution with key-per-channel is as follows:

Metadata is undecided, the last message or another channel could work. Message entries will look like this (apart from indentation):

{
    "references": /* message ID or null */,
    "content": "base64 pickled data for 1",
    "key": "key of the data, as a double check in case of a hash collision",
    "parts": [/* message IDs of other keys, e.g. two and three if this was one */]
}

The key with O(1) lookups here is to turn this into a hash table. We take the length from the metadata, however we end up doing that, and then use a hashing algorithm on the base64 string. For Python's hash(), we could use the following formula to calculate an index: (hash & 0x7fffffff) % size. Although, the issue with hash() is that it's based on a random seed set at startup time. We can get around this with the PYTHONHASHSEED environment variable, but then this makes our database susceptible to timing attacks. Likely, we'll have to generate a hash seed for when the database is initially created, and then store that in the metadata, and reuse it for every program.

I was just thinking about this and what happens when you update a field? We're going to have to change the location in the hash table which means we have to then go to the other parts and update their list of parts with the new message ID of the changed part. This is fine for a single piece of data being updated but what if we update all parts of the entry at the same time? How do we manage the order of updates to avoid something similar to a dangling pointer?

Actually now that I think about it we could put the changes in some form of list and just treat it as individual changes happening in succession.

ZeroIntensity commented 1 month ago

Per-key Channels

As an overview, each message represents a single "key" -- i.e., a key-value pair in a table. The key is specific to the channel. For example, if you defined the following schema for a table:

class User(BaseModel):
    name: str
    password: str

Then we would have two channels for User: User-name, and User-password, that would store their values according to the below format (this will not have indentation nor newlines in production, this is just for convenience here):

Record Format

{
    "content": "base64 pickled data for 1",
    "parts": [/* message IDs of other keys, e.g. two and three if this was one */]
}

Here's a visual representation:

image

Storage and Lookup

Here's where it gets tricky. In order to be speedy, we have to do some magic with lookups. Roughly speaking, most serializable types in Python are also hashable, so we can use that to our advantage.

We'll store the metadata in a separate channel (this format is currently undecided on, but it will likely just be simple JSON data.) -- this metadata will contain something important: the length. Similar to in-memory hash map implementations, we need to "preallocate" some space. We'll do this by sending a predefined number of messages to populate the channel after creation. (Note that null entries will be indicated by a message containing only .)

Now, we can use Python's hash() function to generate a hash for a table record. Hashes are "unique" to the data -- this will be important later. (As an implementation detail, we'll also have to store a cryptographically secure hash seed in the metadata mentioned previously, and then set the PYTHONHASHSEED environment variable.)

OK, so what do we do with a hash? A sample hash could look like this: -8647829515386124796. Now, we need to somehow turn this into an index. We can do that via this formula: (hash & 0x7fffffff) % size. If we had 16 messages in the channel, then this becomes 4. That's an index, great! We can edit out the . from the message there, and put in the table entry format from above.

As a refresher, here's how basic storing works:

image

Hash collisions

Hash collisions are basically inevitable, especially on low-size tables (or channels, in this context). If we try to store something, and the calculated index is not ., then we either have a hash collision, or these two keys have the same value -- we don't really care right now.

The solution is to simply iterate until we find an open index. With the previous representation, it could be visualized as:

image

Looking up hash collisions

We follow the exact same formula as storing, but we do have to deserialize along the way to check for equality. For example, with the above image, if the request query was peter, then we would look up the index, and then iterate until we find something that isn't peter.

Table resizing

We'll have to store an additional field in the metadata that contains the actual number of messages stored. For example, we might have 16 messages (called the "size", in technical terms), but only 4 of them are non-null (called the "length"). When the length reaches the size, we need to resize the table. Generally, we'll want to do this by doubling the size of it (e.g. 2 -> 4, 4 -> 8, 8 -> 16, and so on.)

Unfortunately, every time we resize, we have to redefine every message, because the index will change when the size is changed.

Ratelimits are an issue here, because Discord only allows us up to 50 requests per second. How are we supposed to update a table that has a thousand entries in a timely manner? We'll do this by storing ratelimited changes in-memory, and pretending as if it really made the changes. In the background, we'll be waiting for the changes to actually be made, and then we'll remove them from memory. This may cause memory problems on very large databases, but there really isn't any way around this. It might be worth introducing some Memray unit tests, but that would be a finishing touch.

Please let me know if you have any questions!