Closed ZeroIntensity closed 1 month ago
Are we going to be loading the full 'table' into memory to then query it or are we going to have some way to query the channel (table) directly?
Probably a bit of both. We can use an in-memory cache for faster lookups, and just update the cache upon edit.
The biggest problem with the current mechanism is that there isn't any clear mechanism for querying based on existing data. For example, if the stored content is:
{"one": 1, "two": 2, "three": 3}
What would be the ideal way to find this entry if e.g. the query was {"one": 1}
. Right now, it seems like we might have to resort to an O(n) search that deserializes every entry along the way (which will be extremely slow). I do have a solution, but we would have to force documents to follow a specific schema (which would be user-defined). This might be beneficial anyway, so it's the solution I'm leaning towards, but let me know if you think of anything else.
would have to force documents to follow a specific schema (which would be user-defined)
I agree with this.
Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g. user.posts
would have its own channel). If we're queried with {"one": 1}
, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages.
The entry structure would look like this, per channel:
{"ids": [], "content": "..."}
ids
contains the message IDs of the other partscontent
contains the value only, not the key. In the example above, this would be 1
, pickled and base64 encoded.I thought of a way that we could potentially index entries.
We could have the latest message in the channel contain an index which lets us at least narrow down our search. Say we have a field that is name
then we have the index have A-Z with the message ID for all names that start with each letter stored under that letter in the index. When entries are added we modify the index, delete the old index message and store the new one at the end of the channel.
Then we aren't deserializing everything but instead just a few messages. We could make an index for each field if we needed to then consult the message that corresponds to the significant field.
Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g.
user.posts
would have its own channel). If we're queried with{"one": 1}
, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages.The entry structure would look like this, per channel:
{"ids": [], "content": "..."}
ids
contains the message IDs of the other partscontent
contains the value only, not the key. In the example above, this would be1
, pickled and base64 encoded.
I haven't entirely fleshed this idea out yet, but we could potentially counter storing the ID's by using discord's message references, so we can store each Entry in the Table channel, but use message replies to link them all together, so all the information is linked to each other through references. This way we can use Discords search feature to find the entry of {"one": 1}
, and then find all the data that is connected to it through message references. I'm not entirely sure if the message references work this way, but might do some testing to see if it could be applicable to this.
API reference link: https://discordpy.readthedocs.io/en/stable/api.html?highlight=message#discord.Message.reference
Edit: Something like this. It's not the prettiest though if we want the user to be able to access the db through the channel list.
I thought of a way that we could potentially index entries.
We could have the latest message in the channel contain an index which lets us at least narrow down our search. Say we have a field that is
name
then we have the index have A-Z with the message ID for all names that start with each letter stored under that letter in the index. When entries are added we modify the index, delete the old index message and store the new one at the end of the channel.Then we aren't deserializing everything but instead just a few messages. We could make an index for each field if we needed to then consult the message that corresponds to the significant field.
I think we would still have to require a schema for that.
Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g.
user.posts
would have its own channel). If we're queried with{"one": 1}
, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages. The entry structure would look like this, per channel:{"ids": [], "content": "..."}
ids
contains the message IDs of the other partscontent
contains the value only, not the key. In the example above, this would be1
, pickled and base64 encoded.I haven't entirely fleshed this idea out yet, but we could potentially counter storing the ID's by using discord's message references, so we can store each Entry in the Table channel, but use message replies to link them all together, so all the information is linked to each other through references. This way we can use Discords search feature to find the entry of
{"one": 1}
, and then find all the data that is connected to it through message references. I'm not entirely sure if the message references work this way, but might do some testing to see if it could be applicable to this.API reference link: https://discordpy.readthedocs.io/en/stable/api.html?highlight=message#discord.Message.reference
Edit: Something like this. It's not the prettiest though if we want the user to be able to access the db through the channel list.
I like this idea a lot! Is there a way to see what messages are referenced by another? In your example, the first message wouldn't be able to see the other parts of the data.
Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g.
user.posts
would have its own channel). If we're queried with{"one": 1}
, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages. The entry structure would look like this, per channel:{"ids": [], "content": "..."}
ids
contains the message IDs of the other partscontent
contains the value only, not the key. In the example above, this would be1
, pickled and base64 encoded.I haven't entirely fleshed this idea out yet, but we could potentially counter storing the ID's by using discord's message references, so we can store each Entry in the Table channel, but use message replies to link them all together, so all the information is linked to each other through references. This way we can use Discords search feature to find the entry of
{"one": 1}
, and then find all the data that is connected to it through message references. I'm not entirely sure if the message references work this way, but might do some testing to see if it could be applicable to this. API reference link: https://discordpy.readthedocs.io/en/stable/api.html?highlight=message#discord.Message.reference Edit: Something like this. It's not the prettiest though if we want the user to be able to access the db through the channel list.I like this idea a lot! Is there a way to see what messages are referenced by another? In your example, the first message wouldn't be able to see the other parts of the data.
That was a limitation I've been trying to figure out from the API, but haven't found it yet. A potential work around might be once the message is found, for example the first user_id
, then it can check following messages in the channel to gather all the data, but I'm not sure if it's possible to locate the following messages based on the location of a message in the API, it seems like this might just cause a channel message history limit problem once the db gets large enough. Gonna keep looking and hopefully find something that works with the references, otherwise we might have to resort to the columns as channels idea.
Ok. If we have a schema, then we can know each of the keys beforehand. Instead of a per-table channel, we'll have a per-column channel (e.g.
user.posts
would have its own channel). If we're queried with{"one": 1}
, per the example above, we could use Discord's search on that channel to query it. The downside of this is that it will take up slightly more storage, as each message will have to contain the IDs of the other messages. The entry structure would look like this, per channel:{"ids": [], "content": "..."}
ids
contains the message IDs of the other partscontent
contains the value only, not the key. In the example above, this would be1
, pickled and base64 encoded.I haven't entirely fleshed this idea out yet, but we could potentially counter storing the ID's by using discord's message references, so we can store each Entry in the Table channel, but use message replies to link them all together, so all the information is linked to each other through references. This way we can use Discords search feature to find the entry of
{"one": 1}
, and then find all the data that is connected to it through message references. I'm not entirely sure if the message references work this way, but might do some testing to see if it could be applicable to this.API reference link: https://discordpy.readthedocs.io/en/stable/api.html?highlight=message#discord.Message.reference
Edit: Something like this. It's not the prettiest though if we want the user to be able to access the db through the channel list.
Do we have to store each field in a separate message? Could it not all be stored in a single message. Then we can do much like relational databases do and each message from another table that is related to the base message just stores the channel ID and message ID of it. The channel ID may not need to be stored.
Could it not all be stored in a single message.
Likely not, because of querying.
Likely not, because of querying.
What about threads. I'm not sure what the limitations are on threads but could we utilize them to store related data?
Never mind. Threads auto archive
but I'm not sure if it's possible to locate the following messages based on the location of a message in the API
I think we might be able to, and it won't be terrible on performance. If store a separate ID (generated by us) with each message, then we can search for that ID and then query all results to finalize the fetch.
For example:
123 {"one": 1}
123 {"two": 2}
123 {"three": 3}
If we search for 123
, it will come up with all the results. (Note that we don't have to worry about messages that store 123
in the content, because it will be pickled and encoded anyway.)
but I'm not sure if it's possible to locate the following messages based on the location of a message in the API
I think we might be able to, and it won't be terrible on performance. If store a separate ID (generated by us) with each message, then we can search for that ID and then query all results to finalize the fetch.
For example:
123 {"one": 1} 123 {"two": 2} 123 {"three": 3}
If we search for
123
, it will come up with all the results. (Note that we don't have to worry about messages that store123
in the content, because it will be pickled and encoded anyway.)
From what I can tell there doesn't appear to be a way to search a channel for a message containing certain text with discord.py or the discord API.
Oh. That's an issue.
We'll have to make a hashtable out of the channel, then. The last message will contain the length, and we'll hash IDs and turn them into indicies.
If we are agreed on making a schema for each table then the last message could contain the total number of messages in the channel and the schema for the data stored.
We could also utilize the index messages that I talked about before. The ID for each index message could be stored in the schema alongside the field being indexed. Not every field would need to be indexed but we could allow users to create indexes for heavily searched fields.
How we determine bucket sizes for the index would depend on the type of data but a general A-Z for text, number ranges for integer/real numbers, and date ranges for dates would probably suffice for this type of project.
This would mean that we wouldn't have to keep the whole database in memory all the time and only grab it when a query is being done against non indexed data. Or we could keep it in memory still but only search it when the queried data is not indexed.
Relationships can be indexed as well with each message ID from Table A
being stored as an indexed field in Table B
with a list of message IDs underneath it that reference the related message from Table A
like a foreign key.
Having the index messages referenced in the schema/count message means we only have to worry about leaving that one message at the end of the channel. We could potentially store the indexes in their own channel then.
Edited to attempt to fix the grammar.
We could also pin the message that contains meta data about the table. You can search for pinned messages separately.
That way we don't have to worry about constantly deleting and recreating the message to make sure it is the last one.
If we are agreed on making a schema for each table then the last message could contain the total number of messages in the channel and the schema for the data stored.
We could also utilize the index messages that I talked about before. The ID for each index message could be stored in the schema alongside the field being indexed. Not every field would need to be indexed but we could allow users to create indexes for heavily searched fields.
How we determine bucket sizes for the index would depend on the type of data but a general A-Z for text, number ranges for integer/real numbers, and date ranges for dates would probably suffice for this type of project.
This would mean that we wouldn't have to keep the whole database in memory all the time and only grab it when a query is being done against non indexed data. Or we could keep it in memory still but only search it when the queried data is not indexed.
Relationships can be indexed as well with each message ID from
Table A
being stored as an indexed field inTable B
with a list of message IDs underneath it that reference the related message fromTable A
like a foreign key.Having the index messages referenced in the schema/count message means we only have to worry about leaving that one message at the end of the channel. We could potentially store the indexes in their own channel then.
Edited to attempt to fix the grammar.
I'm not fully understanding (partially because I'm sleepy), could you elaborate and add some examples?
I'll try to give an example to see if it helps clarify my thought process. I may be over complicating this and if I am then please feel free to let me know.
Channel(Table): bookmarked_post
---------------------------------
message_id: 83234829342342 <--- The message ID that is generated by discord. not actually stored in the message
user_id: 23478234823423 <---- The id of the user who bookmarked the post. This could be the user_id created by discord
post_author: rubiks14
post_content: blah blah blah
---------------------------------
message_id: 67495345034594
user_id: 76359374573494
post_author: zerointensity
post_content: blah blah blah
---------------------------------
message_id: 46136468476612
user_id: 23478234823423
post_author: enskyeing
post_content: blah blah blah
---------------------------------
message_id: 53745934053453
user_id: 92349234829342
post_author: enskyeing
post_content: blah blah blah
Also in the channel we have a pinned message that contains the meta data for the table
---------------------------------
user_id: { data_type: bigint, relationship_message_ids: [5345734053407342,] } <--- index location stores the message ID of the index
post_author: { data_type: string, index_message_ids: [45349053450343045,] } <--- author of the post that was bookmarked
post_content: {data_type: string, index_message_ids: [] } <---- not indexing message content (we'd probably need a library to properly do that)
total_entries: 4 <---- how many items are stored in this table
---------------------------------
Now in another channel we could have the messages that contain the indexes. The indexes are used to make searching possible.
This would index all posts bookmarked by each user
relationship on user_id - message_id: 5345734053407342 <--- this message id is stored in the table meta data
---------------------------------
user_id: 23478234823423 <--- user_id of the person who bookmarked the message
message_ids: [
83234829342342 < ---- message_id containing entry data about a bookmarked post
46136468476612
]
user_id: 76359374573494
message_ids: [
67495345034594
]
user_id: 92349234829342
message_ids: [
53745934053453
]
---------------------------------
This would index bookmarked posts by their author.
search_index on post_author - message_id: 45349053450343045 <--- this message id is stored in the table meta data
---------------------------------
post_author: rubiks14
message_ids: [
83234829342342 <--- message_id of an entry where rubiks14 is the post_author
]
post_author: zerointensity
message_ids: [
67495345034594
]
post_author: enskyeing
message_ids: [
53745934053453
46136468476612
]
---------------------------------
When we initialize the database we get the meta data and the index locations from the pinned meta data post. The content of each index message is then turned into dictionaries where the field value is the key and the set of message_ids is the value.
This example has the post_author as the bucket label for that particular index, but the bucket label could be something more generic and more values could be placed inside the bucket. This would require a search through the bucket, but it would be a subset of the whole dataset so the search would be faster.
So for this example if we're looking for all bookmarked posts authored by rubiks14 and saved by user_id 23478234823423 we do a set intersection of the user_id relationship index and the post_author index.
I don't know what the character limit is on a message but if the index gets too big an additional message can be created for overflow and then attached to the metadata.
I hope that makes more sense. I'm also tired so I'm not the best at explaining it right now.
I have made some edits to hopefully make this easier to read. I usually need a couple of drafts to finally get something to where I feel like it is clear.
Sleep does the body good. After sleeping I realize that this is me trying to optimize lookup time too early. This definitely uses more memory than storing the full table in a hash.
At least for initial testing and prototyping it would be sufficient to store the whole table in a dictionary/hash with the message id as the key and the message content as the value. Then we do an iterative search whenever we need to query specific data.
We can use discord.TextChannel.history to initialize the dictionary.
I think I'm not seeing this clearly, still. The message author isn't really useful, since it will always be the bot's ID (we won't be sending messages manually).
Im sorry I didn't specify the table details. This table is an example table that stores messages that a user has bookmarked. The bookmarked data contains the user id of the user that bookmarked the message, the author that posted the original message being bookmarked, and the content of the message being bookmarked. The table itself could be anything. The metadata and indexes refer to the fields that the table has.
we determine bucket sizes for the index would depend on the type of data but a general A-Z for text
FWIW, all user data will be encoded in base64, meaning you only have to worry about those characters.
I'm not really a fan of using the pinned messages for "metadata," it should just be the most recent message that stores it (and gets deleted and replaced with every update). Otherwise, the only way to differentiate between if a message is the metadata or an actual entry is by checking if it's in the pinned list, which won't scale well with O(n)
.
We could have a separate metadata channel for each table. Assuming each table would have an id, we could have <table-id>-metadata
and <table-id>-contents
as channels per table. That way the metadata never gets in the way of our table data and we don't have to worry about fumbling with deleting and resending messages every time, just editing in any new information. It would also allow for additional metadata messages to be sent if the original message meets Discords character limit which iirc is around 2000chars.
I thought about making the metadata post the first post that is created when the table is created. That way it is always the oldest but I like the idea of doing a separate channel for metadata better.
You'll have to explain your idea to me over call, @Rubiks14. My solution with key-per-channel is as follows:
Metadata is undecided, the last message or another channel could work. Message entries will look like this (apart from indentation):
{
"references": /* message ID or null */,
"content": "base64 pickled data for 1",
"key": "key of the data, as a double check in case of a hash collision",
"parts": [/* message IDs of other keys, e.g. two and three if this was one */]
}
The key with O(1)
lookups here is to turn this into a hash table. We take the length from the metadata, however we end up doing that, and then use a hashing algorithm on the base64 string. For Python's hash()
, we could use the following formula to calculate an index: (hash & 0x7fffffff) % size
. Although, the issue with hash()
is that it's based on a random seed set at startup time. We can get around this with the PYTHONHASHSEED
environment variable, but then this makes our database susceptible to timing attacks. Likely, we'll have to generate a hash seed for when the database is initially created, and then store that in the metadata, and reuse it for every program.
We could have a separate metadata channel for each table.
I like this idea, just seeing metadata
in a channel as a message with your data just doesn't feel right: also, we're just going to end up having double the channels every time as a drawback if we create a new metadata channel for each table.
I think the best idea could be just to have a channel full of metadata that channel could just be a forum channel to help with organization. But, I haven't exactly checked what functions we'd need to use on d.py to have control over forum channels: I think they still use thread functions for it because forums are just threads iirc.
I like the metadata channel idea as well. I don't see why we would need a metadata channel per-table, a single one that contains the metadata for each channel should do fine. We don't have to think about this too much, this can be an O(n)
search since the number of tables generally won't be very high.
A format like this should suffice:
{
"name": "table name",
"length": 0,
"schema": "pickled pydantic schema",
}
Note that we won't actually be using the pickled schema at runtime, it's just an extra check to make sure that it matches. If it doesn't, we'll either have to run a migration or throw an error.
You'll have to explain your idea to me over call, @Rubiks14. My solution with key-per-channel is as follows:
Metadata is undecided, the last message or another channel could work. Message entries will look like this (apart from indentation):
{ "references": /* message ID or null */, "content": "base64 pickled data for 1", "key": "key of the data, as a double check in case of a hash collision", "parts": [/* message IDs of other keys, e.g. two and three if this was one */] }
The key with
O(1)
lookups here is to turn this into a hash table. We take the length from the metadata, however we end up doing that, and then use a hashing algorithm on the base64 string. For Python'shash()
, we could use the following formula to calculate an index:(hash & 0x7fffffff) % size
. Although, the issue withhash()
is that it's based on a random seed set at startup time. We can get around this with thePYTHONHASHSEED
environment variable, but then this makes our database susceptible to timing attacks. Likely, we'll have to generate a hash seed for when the database is initially created, and then store that in the metadata, and reuse it for every program.
I was just thinking about this and what happens when you update a field? We're going to have to change the location in the hash table which means we have to then go to the other parts and update their list of parts with the new message ID of the changed part. This is fine for a single piece of data being updated but what if we update all parts of the entry at the same time? How do we manage the order of updates to avoid something similar to a dangling pointer?
Actually now that I think about it we could put the changes in some form of list and just treat it as individual changes happening in succession.
As an overview, each message represents a single "key" -- i.e., a key-value pair in a table. The key is specific to the channel. For example, if you defined the following schema for a table:
class User(BaseModel):
name: str
password: str
Then we would have two channels for User
: User-name
, and User-password
, that would store their values according to the below format (this will not have indentation nor newlines in production, this is just for convenience here):
{
"content": "base64 pickled data for 1",
"parts": [/* message IDs of other keys, e.g. two and three if this was one */]
}
content
is the value of the "data" (any arbitrary Python object that can be serialized). This is a base64 encoded string containing the dill pickled representation of the object.parts
is an array containing the message IDs of the other key-value pairs of the table. For example, if we were looking at the name
message for our User
schema from above, then this array would contain the message ID of the password
value, and vice versa.Here's a visual representation:
Here's where it gets tricky. In order to be speedy, we have to do some magic with lookups. Roughly speaking, most serializable types in Python are also hashable, so we can use that to our advantage.
We'll store the metadata in a separate channel (this format is currently undecided on, but it will likely just be simple JSON data.) -- this metadata will contain something important: the length. Similar to in-memory hash map implementations, we need to "preallocate" some space. We'll do this by sending a predefined number of messages to populate the channel after creation. (Note that null entries will be indicated by a message containing only .
)
Now, we can use Python's hash()
function to generate a hash for a table record. Hashes are "unique" to the data -- this will be important later. (As an implementation detail, we'll also have to store a cryptographically secure hash seed in the metadata mentioned previously, and then set the PYTHONHASHSEED
environment variable.)
OK, so what do we do with a hash? A sample hash could look like this: -8647829515386124796
. Now, we need to somehow turn this into an index. We can do that via this formula: (hash & 0x7fffffff) % size
. If we had 16 messages in the channel, then this becomes 4
. That's an index, great! We can edit out the .
from the message there, and put in the table entry format from above.
As a refresher, here's how basic storing works:
Hash collisions are basically inevitable, especially on low-size tables (or channels, in this context). If we try to store something, and the calculated index is not .
, then we either have a hash collision, or these two keys have the same value -- we don't really care right now.
The solution is to simply iterate until we find an open index. With the previous representation, it could be visualized as:
We follow the exact same formula as storing, but we do have to deserialize along the way to check for equality. For example, with the above image, if the request query was peter
, then we would look up the index, and then iterate until we find something that isn't peter
.
We'll have to store an additional field in the metadata that contains the actual number of messages stored. For example, we might have 16 messages (called the "size", in technical terms), but only 4 of them are non-null (called the "length"). When the length reaches the size, we need to resize the table. Generally, we'll want to do this by doubling the size of it (e.g. 2 -> 4, 4 -> 8, 8 -> 16, and so on.)
Unfortunately, every time we resize, we have to redefine every message, because the index will change when the size is changed.
Ratelimits are an issue here, because Discord only allows us up to 50 requests per second. How are we supposed to update a table that has a thousand entries in a timely manner? We'll do this by storing ratelimited changes in-memory, and pretending as if it really made the changes. In the background, we'll be waiting for the changes to actually be made, and then we'll remove them from memory. This may cause memory problems on very large databases, but there really isn't any way around this. It might be worth introducing some Memray unit tests, but that would be a finishing touch.
Please let me know if you have any questions!
So far, we've only speculatively thought about how we're going to actually store our data. We need to set this in stone to write an actual implementation.
Storage Hierarchy
Data will be stored in the following hierarchy, based on #3 (highest to lowest level):
Database
class in the current prototype. In theory, all a bot will need to operate is a token and server ID. This should be "independent," as in the same bot can be run on different servers using multipleDatabase
instances.Table
class in the current prototype. Each table should be, more or less, independent from one another. For example, there could be a table/category forusers
, and then another forposts
.Entry
. This will be stored as JSON, perhaps compressed.Entry Format
Every message/database entry should simply be (possibly compressed) URL-safe base64 pickled data (though, done with dill, not pickle).