Arke: New storage system for Red, relational databases and ORM!

Hey, I'm writing this issue after a conversation in #advanced-coding where I initially suggested the following change.

Config is great, allows anyone to store stuff easily within Red, but it has a lot of limitations too. After playing a lot with django recently, I kept thinking about integrating an ORM in Red, for cog developers, as an additional tool, which I will refer to as "Arke"

What will Arke be?

Arke will be like Config : providing tools to cog creators for storing data without worrying about creating files or setting up a database. However, the stored data can be relational, which means no more dicts and hierarchy, only models and relations!

The models will be defined using an ORM, and Red will handle the rest (registering the models, preventing conflicts, handling the database based on the user's choice).

What will Arke not be?

It is NOT a replacement for Config, it is an optional addition. Anyone should be free of using Config or Arke.

Relational models? What is that?

Unlike Config, the storing schema will not be dictionaries, but relational models. You have to describe what a model should look like, and what it should be linked to.

Dictionaries are limited to two relations: parent and children, while relational models have no such limit, and do not need to worry about hierarchy. Take the following example :

class ModLog:
    id: int  # this will be the primary key
    # basic fields
    reason: str
    time: datetime.datetime
    type: Literal["warn", "mute", "kick", "ban"]

    # relational fields
    guild: discord.Guild
    member: discord.User
    moderator: discord.User

With a model like that, you can easily query and filter all modlogs for a guild, a member, or a moderator. From that modlog object, you can access the other linked attributes back.

>>> all_modlogs = ModLog.objects.all()
>>> guild_modlogs = ModLog.objects.filter(guild=red)
>>> all_reasons = ModLog.objects.column(reason)

>>> my_modlogs = ModLog.objects.filter(member__id=348415857728159745)
>>> first_modlog = my_modlogs[0]
>>> first_modlog.moderator
<discord.Member(Kowlin)>

(This is absolutely not valid Python code, just showing examples)

All of the fields described above are columns, which means you can for example query all reason objects of modlogs, without querying the other stuff. That kind of operations make everything not only more optimised for both time and storage space, it also makes programming a lot more convenient.

What is an ORM?

ORM means Object relational mapping.

Databases like that need SQL operations, and it's annoying to deal with, but thanks to ORMs, you can simply build an object in a language, and it will be translated to SQL queries.

There are multiple ORMs for Python, the two most popular ones are SQLAlchemy and Django ORM.

I personally find the latter easier to use, but there's no standalone version, plus SQLAlchemy has some great features (support for more dbs, async operations, advanced SQL support...), so I guess we'll use that.

Let me show some examples from their docs:

Creating models

class User(Base):
    __tablename__ = 'user_account'

    id = Column(Integer, primary_key=True)
    name = Column(String(30))
    fullname = Column(String)

    addresses = relationship("Address", back_populates="user")

    def __repr__(self):
       return f"User(id={self.id!r}, name={self.name!r}, fullname={self.fullname!r})"

class Address(Base):
    __tablename__ = 'address'

    id = Column(Integer, primary_key=True)
    email_address = Column(String, nullable=False)
    user_id = Column(Integer, ForeignKey('user_account.id'))

    user = relationship("User", back_populates="addresses")

    def __repr__(self):
        return f"Address(id={self.id!r}, email_address={self.email_address!r})"

Manipulating data

# Taking the first user
>>> session.scalars(select(User)).first()
User(id=1, name='spongebob', fullname='Spongebob Squarepants')

# Taking only some keys from the first user
>>> session.execute(select(User.name, User.fullname)).first()
('spongebob', 'Spongebob Squarepants')

# Picking a specific object
>>> session.execute(select(User).where(User.name == 'spongebob'))
User(id=1, name='spongebob', fullname='Spongebob Squarepants')

# Progressively fetch users ordered by ID
>>> result = session.execute(select(User).order_by(User.id))
>>> result.fetchone()
(User(id=1, name='spongebob', fullname='Spongebob Squarepants'),)
>>> result.scalars().all()
[User(id=2, name='sandy', fullname='Sandy Cheeks'),
 User(id=3, name='patrick', fullname='Patrick Star'),
 User(id=4, name='squidward', fullname='Squidward Tentacles'),
 User(id=5, name='ehkrabs', fullname='Eugene H. Krabs')]

Of course this needs more explanations, this is just a quick overview of what you can do. I will write detailed and comprehensive docs for this module.

What will this look like in Red?

Just showing some draft ideas, nothing is done yet, I'm giving stuff I have in mind and I'm very open to suggestions.

Developer side

I need to play more with SQLAlchemy before being sure of how this is going to work, but ideally, I want something that will be very similar to config :

When loading the cog, you initialise an Arke object, which will create a unique database for your cog, so you're free to do all the tables you want.
The Arke object will be used for running the queries. That, or do it the django way and only rely on the object type (which means typevars for us)
The cog's metadata should tell if it is using Arke or not, and eventually some recommendations on the database to use (for example, postgresql supports fuzzy search, and SQLAlchemy supports db-specific operations to allow this, or fall back to the default).

User side

There are multiple ways of doing this, and I want it to be as seamless as possible. Still unsure on what to do so please comment on this 👀

Here's something I had in mind:

When setting up a new instance, the user can add the --with-arke flag, which will add another step to the process: asking what db to use for Arke-compatible cogs. The default option should be SQLite3.
If an instance is already setup, the Arke setup applied on top of the existing instance with --setup-arke?
When the user is installing a cog that uses Arke while it is not setup, there should be a message explaining that Arke needs to be initialised on this instance, using the step above.

That, or simply setup Arke on all instances by default, but for something optional, not sure if it's a great thing.

A quick FAQ

Why "Arke"

Had to come up with a name other than BetterConfig :kappa:

Searched for a nice name, and found Arke from the Greek mythology. During the Titanomachy, she was a messenger from the Titans, sending messages to the Olympian gods. You can read more about her here.

Since the goal of the application is to provide an interface to cog developers, and then handle saving the data on our side, I think that's a cool and fitting name :D

Why should this be part of core Red?

Cogs have already been using databases other than Config like leveler, without any tool like that from Red. However, learning how to configure a database, connect to it and send queries is hard. Even harder for the users who have a hard time setting up that database themselves (see Fixator's channel 👀).

On the other hand, Config is great: support for different drivers, easy interface for all cog devs, and easy to use for the end user. I want to bring that to relational databases too.

If this module is independent from Red, it's not going to change a thing: each cog will have its own database system, without any coordination. In addition, new devs will simply use the existing system, and not bother adding a whole layer to use a specific database.

I never wanted myself to learn SQL, JSON is way easier to use. However, the way django works with its ORM made everything so much easier, made me love relational database schemas, and now I'm having a hard time using JSON. I think if the tool is there, ready to use, more developers will want to try it.

Wouldn't that make the setup process harder for end users?

Sure, setting up a database can be hard: installing additional programs, setting up hosts and passwords, making sure the connection stays alive... except for SQLite!

Config has complex driver options which requires that kind of setup too, but there's always one easy and straightforward option: JSON. Well in our case, that easy and straightforward option is SQLite3, it's just a text file, doesn't require any server! Sure you will be losing performance, but for small sized bots, that will be more than enough. Large bots have already switched to Redis or other drivers than JSON.

What are the other advantages of using a db

Omitting the ORM, tables and relational stuff here.

It's blazing fast, and storage efficient
Migrations! No need to hand write your JSON anymore, they can be automatically builded for everyone!

I think I explained everything I had in mind. As you can see, it's just ideas in my head, but they're pretty clear. I will start working on some prototypes very soon, and want to hear your feedback!

Update (less than 30 min after yes): Of course we're not limited to SQLAlchemy as an ORM! Just talked to Preda which mentioned piccolo ORM, and that looks a lot better! This issue is for sorting the main points, including this one

Hey, first real update, I looked at multiple Python ORMs and wanted to present a list of potential candidates so we could discuss about it, but there are very few solutions that fit our requirements. Most notably, the following ORMs won't fit for Arke:

peewee, SQLObject, PonyORM: No async support (among other issues for some)
SQLAlchemy: Insanely slow, hard to use and unfamiliar syntax, poor async support
GINO: It's basically SQLAlchemy with better async support, but still the unfamiliar syntax
peewee-async: Only supports postgresql, seems new and incomplete yet
django: ORM is deeply integrated to other Django components, so either we add django in its entirety as a new dependency (no), or we use unstable projects that try to make it standalone (also no because we'd have to combine that to other extensions for async support)
piccolo: Looks super nice, recommended by Preda and Draper, BUT there's no (or hazardous) SQLite support, which I consider critical for Arke
ormar: That one was very hard to cross out (and would like to discuss about it if you have more info). Has everything we need (async, django-like models, modern), but it's still in development and seems to go through multiple breaking changes.

That leaves us to Tortoise ORM: it's async, its models are comprehensive and intuitive (django style), it's active, frequently updated, has support for SQLite, postgresql, MySQL, MariaDB, and most importantly, it's fast, like really fast compared to other ORMs:

See benchmarks here.

Examples

So I think I'll go for that! I'm really open to discussion on this subject (or everything about arke), for now I will mostly play with the tools, see what's possible and think about the design before doing actual work in the repo.

Therefore, time to throw more examples, but this time using Tortoise specifically.

Simple model: adventure player

This is a small example of a model without relations. ### Writing the model ```py import enums from tortoise import fields from tortoise.models import Model class PlayerClass(Enum): wizard = 0 bard = 1 ranger = 2 class Player(Model): name = fields.CharField() health = fields.IntegerField() level = fields.IntegerField(default=1) class = fields.CharEnumField(null=True) # can be empty ``` ### Creating players ```py >>> p1 = Player(name="Laggron", health=100) >>> await p.save() # or directly >>> await Player.create(name="aika", health=200, class=PlayerClass.bard) ``` ### Accessing players ```py >>> await Player.all() [, ] >>> await Player.get(name="aika") >>> await Player.filter(health__lte=150) # players with health <= 150 [] ``` ### Reading or setting attributes ```py >>> aika = await Player.get(name="aika") >>> aika.health 200 >>> aika.health -= 40 >>> await aika.save() ```

Relations: ForeignKey

Let's see the most basic relation, the foreign key (or OneToMany). Take for example a ticket system. There are tickets and members, members create tickets. ### Writing models ```py from tortoise import fields from tortoise.models import Model class Member(Model): discord_id = fields.IntField() guild_id = fields.IntField() class Ticket(Model): time = fields.DateTimeField(auto_now_add=True) # will be automatically set to utcnow() on creation message = fields.CharField() member = fields.ForeignKey(Member, on_delete=fields.CASCADE) ``` Each ticket must be linked to **one** member, a member can have **many** tickets (OneToMany). `on_delete=fields.CASCADE` means that when the member object is destroyed, all of his tickets are also destroyed. ### Creating the objects ```py >>> m = await Member.create(discord_id=260, guild_id=42) >>> await Ticket.create(message="plz add martine", member=m) ``` ### Accessing objects ```py >>> m = await Member.get(discord_id=260) >>> t = await m.tickets.first() >>> t.member == m # reverse access True >>> all = await Ticket.filter(member__discord_id=260) # all tickets from one member [] ```

Relations: ManyToMany

Another type of relation is ManyToMany: An object A can be linked to multiple objects B, and objects B can also be linked to multiple objects A. A side note: Config has built-in relations for the most common models, Arke could have the same and provide pre-built models? (Would also make GDPR easier.) Let's suppose that for the following example. ```py from tortoise import fields from tortoise.models import Model from redbot.core.arke.models import Member, TextChannel class Event(Model): channel = fields.ForeignKey(TextChannel, on_delete=models.CASCADE) admins = fields.ManyToMany(Member) # an event can have many admins, an admin can have many events description = fields.CharField() start_time = fields.DateTimeField() ``` Then, when creating objects: ```py >>> event = await Event.create(channel=ctx.channel, description="Fall Guys party", start_time=some_datetime_object) # notice how admins is omitted, it's initially empty >>> await event.add(ctx.author, guild.get_member(some_id)) # adding two members as admins >>> async for admin in event.admins(): ... print(admin.name) Kowlin JennJenn ```

Relations: OneToOne

Last relation, you probably guessed it, OneToOne relations link two unique objects together. ```py from tortoise import fields from tortoise.models import Model from redbot.core.arke.models import Role class AutoRole(Model): role = fields.OneToOneField(Role, on_delete=models.CASCADE) command_name = fields.CharField(null=True) button_id = fields.IntField(null=True) reaction_id = fields.IntField(null=True) ``` That way, a role cannot be linked to multiple auto roles.

Will write a lot of docs, with better examples to make it easy for beginners!

Integration with Red

Initializing Tortoise is quite easy, and most importantly doesn't require an instance of something:

from tortoise import Tortoise

await Tortoise.init(
    db_url='sqlite://db.sqlite3',
    modules={'models': ['app.models']}
)
await Tortoise.generate_schemas()

This means there is no instance to pass around like Config. There is still stuff to do on our end, which will likely need additional lines on cog start to register the models and generate the schemas.

Something I'm not sure of is how to handle different cogs. Of course you can't put everything in the same database, there will be conflicts everywhere, so basically there are two solutions:

Add a prefix on table names (something like Adventure__weapons). Allows relations between cogs' models if specified, and unique Discord models for everything
Create a unique database for each cog. This makes it impossible to have relations from one cog to another, but at least isolates everything and prevents collateral damages.
- In addition, we could allow cogs to have specific requirements for backends. Suppose you're recoding Leveler and need a super efficient backend, you don't want SQLite because postgresql has some sweet features, then you could ask for this requirement, and Red will prompt the user to install the dependencies and setup the db.

So, yeah, think ideas are starting to get more precise, will start working on stuff, show it in #testing and ask for feedback, don't hesitate to leave some here too!

I had this exact same idea myself, but I've been told that it wouldn't be a great idea to integrate it right into Red itself (I cannot hand out the conversation nor remember much of what I was told), but rather as a dependency for cogs. This is my way of thinking too, I see Config as enough/necessary for Red & devs. Introducing such a framework (And I'd make the difference between a library and a framework) into Red directly, is not what I think is a great idea.

Ok, now let me tell you I love the idea (Most especially since I had myself), I'd like to make it a reality too for devs, but I think you've been more into the developer-side problem-solving rather than the user-side impact that may come with it, so I'll try to explain a bit more below. Feel free to contact me on Discord (Capt. Pred#0495) after reading my comment if you'd like, I'd love to have a conversation with you if you have anything to say that doesn't require a full comment here. :)

Why Config and not Arke into Red?

I see Config as part of Red because Config is a direct relation with the raw data and the cog, which is, well, extremely needed for creating cogs in Red today, anyhow, I think I don't need to debate on why Config is needed in Red's core.

Now, I think that Arke shouldn't be part of Red for some reasons, and to make it easier to read, I'll make a list:

You're forcing users to use Arke, because if one cog they want to install does require Arke, they have to install Arke.
Users don't really care about how you ship your product, they care that it works right away (That's a general sentence, not just with Red, and I'm not criticizing the work you're trying to add), so having another installation step for users might not be the right idea.
Users may not want to migrate to a relational DB for many cases.
If users decide to go for PSQL/MySQL/MariaDB, they need to know how to manage it, and I'm not quite sure everyone knows how to do that (I still hate it myself...).
~~SQLite3 is mostly made for development purposes IMO, but I guess it's fine for little bots like Red instances.~~

In top of all of these, I personally think it doesn't fit Red's core itself, but I don't really have words to describe it, sadly.

How I approached the idea myself

To refer to "my ORM framework", I'll give it the name of "RORM" (As a short of RedORM, a temporary name I had)

When doing research on how to integrate an ORM-like experience for developers, here's what I ended up with:

RORM is a supplement of Config, it uses config, and does not ship anything else (DB) with it to store data.
RORM may not support relational data (At least, not at first when creating the framework) due to how Config works.
RORM will probably not support foreign key either.

In the end, I've found out that Pydantic might be what I was looking for (Using ORM mode), since there's no apparent other library like Pydantic, AND that it seems it would support Config properly, in top of providing a great type hinting + clean code for the end developer.

As of today, I still have made no code regarding RORM, and I am not planning to code it anymore, so I'm sharing the knowledge I was able to gather so far, so I could help you too.

I hope my comment will help you to have more guidance and understand my thought on your idea, hoping you'll find it instructive. I'd love to hear other's thoughts too here, as I've mainly had thought from 2 or 3 core developers.

Personal inspiration

I mostly had that idea myself when I was working with AdonisJS with it's ORM, it might be interesting to see how that one works by itself, even if that's a whole different language.

Hey, thanks for your comment! While I understand your view, I think there are solutions to make Arke fit well into Red. Let me respond to your points.

You're forcing users to use Arke, because if one cog they want to install does require Arke, they have to install Arke.

Cog creators are free to enforce whatever they want for they cog. Install Leveler from fixator, you need MongoDB. Install notsobot from Trusty, you need to setup imagemagick. I'm not sure why a requirement is a problem for cogs? Especially when the setup will be built into Red, the only thing they'll have to do is another pip install.

Users don't really care about how you ship your product, they care that it works right away (That's a general sentence, not just with Red, and I'm not criticizing the work you're trying to add), so having another installation step for users might not be the right idea.

I agree, but once again, that installation step is just a pip install. Maybe in the future, if arke is widely used, its requirements will be installed by default. Also, I think users will enjoy the speed performance, and being able to install complex cogs without having to setup a database themselves is also valuable.

Users may not want to migrate to a relational DB for many cases.

Not sure what you mean here. What migrations? The developer should handle them, and more precisely using the tool provided by arke for generating migrations (like django does). Migration from config to arke is also on the dev.

If you meant end users having to use a relational DB, also not sure why it's a problem. Just like everything else, they won't even notice what happens under the hood and have a program that works out of the box.

If users decide to go for PSQL/MySQL/MariaDB, they need to know how to manage it, and I'm not quite sure everyone knows how to do that (I still hate it myself...).

If users decide to go for postgresql for Config, they need to know how to manage it. How is that a change? SQLite's config on the other hand is seamless, just like bare JSON files.

About your idea, well the first con is that it's based on Config. Basically, you're just bringing oriented object programming to config, without the benefits of relations or the speed increase.

Arke is not just about bringing an ORM, it's about setting up the environment around it to make it easy to use for both developers and users. If I wanted to use an ORM just for my cog, I would have just used Tortoise. On the other hand, bringing it to Red with an easy interface encourages devs to look into it. Not only will they learn interesting stuff, they will also write more efficient cogs. It's also easier, more intuitive to use and has better readability. If it wasn't for Django, I would never have used an ORM, and I want to bring the same chance to Red devs, especially since a database is not a light requirement in a standalone cog.

I am -1 on this proposal as it is currently written for the following reasons:

The name "Arke", while nice and pretty, does not do a good job of communicating what this system actually does. When a user is told to use Red's "Config", it is pretty clear what sort of stuff they are going to see. If a user is told to use Red's "Arke", they will have no idea what that means without additional explanation. I understand the Greek meaning, however most people do not have the background to understand the connection at first glance. I agree BetterConfig is not a great name, but there is probably a name for it which still adequately describes what it does (RelationalConfig?)
I don't see any reason why putting this in core is helpful. Like Pred said, some kind of basic data storage is necessary for Red as it needs to be able to store data in an easy and safe way for core reasons as well as for Cog Creators without experience working with JSON files safely. However, once a Cog Creator has gotten to the point of needing a relational db for performance or structural reasons, I don't believe they need to be hand held by a core system anymore. I believe that using a system like config which bundles together lots of DB backends in a not entirely efficient way is not solving the actual problem that needs to be solved for those cases where performance or structure is truly important to the function of the cog. Currently, if a Cog Creator believes that Config is too restrictive or too underperforming for their use case, they are able to create their own custom data storage system (even choosing to use a relational db like SQLite or Postgres) and use it to its full potential. While Config meets the needs of the very low impact and simple cases it is designed to handle, I don't think this system is able to meet the needs of the kinds of cases it is designed for.
It is another thing that needs to be maintained in Core. Going back to what I said earlier, if a Cog Creator is using a relational db, they are likely doing advanced things which require they understand how a relational db works and how to use it. They should be knowledgeable enough to know how to run and manage the db they are using. With that said, I don't think the benefit of making it easier to use outweighs the effort it will take in Core to create, maintain, and support this feature. Config is important because pretty much any cog needs some kind of persistent storage, but only a small percentage of cogs choose to use a db system outside of Config. I do not think something like this is necessary or worth having in Core.

However, learning how to configure a database, connect to it and send queries is hard. Even harder for the users who have a hard time setting up that database themselves

In this particular case, I believe it is the Cog Creator's job to decide the benefits and drawbacks of choosing a particular db. If they decide they need more performance than something simple like SQLite can provide, to the point where requiring users to set up a db is something they are OK with, then that is what they can do under the current system. The proposed system would remove the ability to make this decision from the Cog Creator (assuming they want to or choose to use this system) and would as such render this system unusable for anyone who needs more power than the lowest denominator SQLite can provide (similar to how config is rendered unusable for anything that needs more than its lowest common denominator JSON can provide). If even less Cog Creators would have a use for this system, it makes even less sense to add it to Core.

That being said, I am only one person. If the other Org Members believe this is a good idea, it can still be added. My opinion is also able to be changed if these points are addressed in a way that satisfies my concerns. I did not feel my concerns were satisfied in the conversation in #advanced-coding, however I may have failed to properly articulate them.

That, or simply setup Arke on all instances by default, but for something optional, not sure if it's a great thing.

If this is something added to core, then it should be treated identical to Config in the setup process. It should have one question asking what backend to use, with the recommended default being SQLite. If the SQLite version of this is as seamless for users to use as JSON is for config, then there is no reason to hide this behind a flag and complicate usage in cogs.

It's a +1 from me, and definitely needed in core.

Addressing @Flame442 concern from my perspective of what I understand Arke will likely look like:

It's an ORM, similar to django models. As shown by the initial proposal, we're aren't dealing with raw sql queries.
It's ideal for storing lists of dicts (or lists of anything really), which config currently is very bad at.
This may later facilitate rewrite of core cogs like alias, customcom to use arke, instead of config. Cause alias lists at scale (And I know red isn't meant for bigger bots, but still) with linear search O(n) takes quite a while, while ORMs use heapsearch(atleast definitely not linear search) speeds are faster in magnitudes.
It's not just for the cog creators, It's for core. Cleaner, smaller and efficient code base at areas. Example: alias code the whole file can be reduced to a simple datamodel + one line queries. AliasEntry becomes the model itself, 0 serializations. and this part as mentioned in the previous point, can be sped up. There's no need to maintain a manual cache internally.
Deeper nested accurate query searches are bonus points, saving IO hits.
Another Config issue is that there's no way to specify types under nested attributes, which is easily solved in config. My linter can be at peace at last.
Also migrations are automated!. Changing each value in every guild config, becomes so easy
The benefits are too good to pass on from not using it in core and 3rd party.

Config as it's meant, solves the problem of basic setting of key values and things necessary to restore the state of the bot. Changing a single value in a huge list, causes to rewrite the whole list again in the case of config, which didn't sit well with me. IO bottlenecks111.

Arke solves the need of incremental storage, when data of similar types are grouped and edits happens frequently. Arke can't survive alone, config is still needed for simple key value pair settings. both can glue, overcoming each other's deficiencies.

Ofc, similar to config, the learning curve is a little steep for those who have no idea about ORMs, but again the goodies later are too lucrative imo. Totally optional to be used by CCs, same as how initally when config was made, people were still using dataIO and manual storage into json files in cog directory.

I'm going to say +1 myself because it does indeed sounds intersting and would definitely use this myself.

But i do have to agree with one part from flame here.

The name Arke does indeed sounds pretty but it would require an additional explanation of what arke really means when we tell them to use arke and they going "i'm so confused, what does that mean?". it's going to confuse the user/cog developer of what it means. We're all used to the word Config. A different name that wouldn't require alot additional explanation would've been better to this. I see flame already added one example word RelationalConfig, it doesn't sounds bad for this but would still need some additional explanation because user/cog developer may ask "what is different between RelationalConfig and Config?".

Cog-Creators / Red-DiscordBot