Concept: Cache module - Githubissues

Bastian commented 3 years ago

I've already suggested a design for an external cache module in https://github.com/Javacord/Javacord/issues/454#issuecomment-832957623. However, since this suggestion was not well received because of the immutability, this issue proposes a mutable external cache that also allows sharing data between multiple shards and allows for different data stores like Redis.

Similar to the immutable cache, the mutable cache holds simple immutable and serializable POJOs that do not contain references to other entities except by their ids.

Since the module is independent of Javacord, we can use another JVM language like Kotlin to reduce the necessary boilerplate.

Storing data objects

The DataEntity interface that every data object has to extend:

interface DataEntity {
    val indices: Array<Index>
}

data class Index(
    val name: String,
    val value: String
)

The data entities itself

@Serializable
data class ServerData(
    val id: Long,
    val name: String,
    // More fields ...
): DataEntity {
    override val indices: Array<Index>
        get() = arrayOf(Index("ID", id.toString()))
}

@Serializable
data class ServerTextChannelData(
    val id: Long,
    val name: String,
    val serverId: Long,
    // More fields ...
): DataEntity {
    override val indices: Array<Index>
        get() = arrayOf(
            Index("ID", id.toString()),
            Index("SERVER", serverId.toString()),
        )
}

A DataStore that hold the data entities and manages their indices

interface DataStore {
    fun insert(value: DataEntity)
    fun get(clazz: Class<out DataEntity>, indexName: String, indexValue: String): List<DataEntity>
    fun delete(clazz: Class<out DataEntity>, indexName: String, indexValue: String)
}

The implementation of the data store is extremely simple and can look like this for a classic in-memory store:

class InMemoryDataStore: DataStore {

    val entities = HashMap<String, MutableList<DataEntity>>()
    val reverseIndex = HashMap<DataEntity, Array<String>>()

    override fun insert(value: DataEntity) {
        val indexKeys = value.indices.map {
            value::class.simpleName + "-" + it.name + "-" + it.value
        }.toTypedArray()
        indexKeys.forEach {
            val list = entities[it] ?: mutableListOf()
            list.add(value)
        }
        reverseIndex[value] = indexKeys
    }

    override fun get(clazz: Class<out DataEntity>, indexName: String, indexValue: String): List<DataEntity> {
        return entities[clazz.simpleName + "-" + indexName + "-" + indexValue] ?: listOf()
    }

    override fun delete(clazz: Class<out DataEntity>, indexName: String, indexValue: String) {
        val results = get(clazz, indexName, indexValue)
        results.forEach { entity ->
            val indexKeys = reverseIndex[entity]
            reverseIndex.remove(entity)
            indexKeys?.forEach {
                val list = entities[it] ?: mutableListOf()
                list.remove(entity)
                if (list.isEmpty()) {
                    entities.remove(it)
                }
            }
        }

    }
}

Since all entities can be serialized (e.g. to JSON), it's easy to create other data stores like Redis. A mixture of different data stores for different data classes (e.g. only users/members in Redis) is also possible and easy to implement.

Using the data objects

We don't want to expose the data objects themselves but keep our current API. The current entities can be constructed as needed:

public class Server {

    private final DataStore dataStore;
    private ServerData data;

    public Server(DataStore dataStore, ServerData data) {
        this.dataStore = dataStore;
        this.data = data;
    }

    public String getName() {
        refreshData(data.getId());
        return data.getName();
    }

    public List<ServerTextChannel> getTextChannels() {
        return dataStore.get(ServerTextChannelData.class, "SERVER_ID", String.valueOf(data.getId()))
            .stream()
            .map(dataEntity -> new ServerTextChannel(dataStore, (ServerTextChannelData) dataEntity))
            .collect(Collectors.toList());
    }

    private void refreshData(long id) {
        List<DataEntity> entities = dataStore.get(
            ServerData.class, "ID", String.valueOf(id)
        );
        if (!entities.isEmpty()) {
            data = (ServerData) entities.get(0);
        }
    }

}

public class ServerTextChannel {

    private final DataStore dataStore;
    private ServerTextChannelData data;
    private Server server;

    public ServerTextChannel(DataStore dataStore, ServerTextChannelData data) {
        this.dataStore = dataStore;
        this.data = data;
        this.server = new Server(dataStore, getServerData());
    }

    public String getName() {
        refreshData(data.getId());
        return data.getName();
    }

    public Server getServer() {
        return this.server;
    }

    private ServerData getServerData() {
        List<DataEntity> entities = dataStore.get(
            ServerData.class, "ID", String.valueOf(data.getServerId())
        );
        if (entities.isEmpty()) {
            return null;
        }
        return (ServerData) entities.get(0);
    }

    private void refreshData(long id) {
        List<DataEntity> entities = dataStore.get(
            ServerTextChannelData.class, "ID", String.valueOf(id)
        );
        if (!entities.isEmpty()) {
            data = (ServerTextChannelData) entities.get(0);
        }
    }

}

The two important differences to the previously suggested immutable cache are, that

Children keep references to their parents. This is necessary because elements might no longer be in the data store when methods like #getServer() are called. The other direction is no problem. For example, the deletion of a server would only result in an empty list when #getChannels() is called.
Before returning data for method calls like #getName(), we refresh the cache. This makes sure that we always return the latest data. For data that cannot change (e.g. the id), a refresh is not necessary.

Updating the cache

Updating the cache is very straightforward:

Delete the old entity from the cache
Insert the new one

Since entities do not store references to other objects, it should be possible to simply map Discord's gateway entities to the data objects and replace them in our cache with very little effort. E.g. for a channel creation or deletion, we can simply add/remove the channel from the cache without having to update any references in the server object (like we currently have to).

Considerations

The code above only exists to visualize the concept. There is a lot of space for improvements, e.g. to simplify the code and get rid of some race-conditions
Always refreshing the cache can be expensive when using any other data store than the in-memory one (traffic + deserialization). There are multiple ways to compensate for this. One option would be to not do the refresh automatically when using a non-in-memory cache but provide a manual #refresh() method that can be called. Another option that would not require any actions from the user is by introducing a caching layer in the store implementation that keeps frequently used entities in local memory.
Keeping references to entities is still a bad practice. However, the effects are now different. During normal operation, keeping entities for an extended period is now slightly worse because we regularly create new entities and do not re-use old ones. However, entities can now "survive" reconnects and keeping "parent" instances (e.g. servers) is much cheaper since they do not have references to all their "children" (e.g., channels and members).
Hybrid caches are possible and easy to implement (e.g., a cache that only stores members and users in an external database and the remaining ones in-memory).
Indices are simply created by concatenating entity name, key name, and key with a - in-between. This is probably secure (since we control both the entity name and the key name) but it might be worth giving it a second thought.
Using external databases (especially non-in-memory ones like MySQL) would result in many methods to become blocking without returning a CompletableFuture. I believe that this is an acceptable compromise. Especially, if you consider that in practice, you don't want every entity to be in a non-memory cache but probably only users and maybe messages. Users that require an external cache should be able to deal with these trade-offs.
This implementation would hardly change anything for the user-facing API. This would mainly be an internal change + the possibility to store entities in external databases
It would be possible to gradually migrate to the new cache
It would be possible to share entities across shards.
The entities (i.e., Server and ServerTextChannel) in the examples do not use interfaces. This would not be the case in the final implementation, but I wanted to keep the examples simple for demonstration purposes
Many things are not covered by the example implementation above (e.g. what happens on reconnects), but I don't think there's anything that is a deal-breaker that prevents this concept from working

Bastian commented 3 years ago

From https://github.com/discord/discord-api-docs/issues/2184#issuecomment-816353472:

[...] It is not feasible to serve the entire member list of every guild to bots, and as guild sizes continue to grow it is unlikely we will support this operation over gateway forever. If you haven't already, I would recommend taking steps to move to persisting and caching data. Having stale data is generally considered OK, and you can update data as you need to instead.

This makes this issue even more important. We should expect, that one day, Discord will remove or at least severely limit the options to request all members of a server. When this happens, we should provide a way for users to have a (semi-)persistent cache for users.

felldo commented 3 years ago

However, since this suggestion was not well received because of the immutability, this issue proposes a mutable external cache that also allows sharing data between multiple shards and allows for different data stores like Redis.

I think an immutable cache is still possible. If I am not wrong with this cache design it should be possible to have a mutable or immutable cache without the need to have a more complex code base:

private void refreshData(long id) {
       if(api.isImmutableCache){
           return;
       }

        List<DataEntity> entities = dataStore.get(
            ServerTextChannelData.class, "ID", String.valueOf(id)
        );
        if (!entities.isEmpty()) {
            data = (ServerTextChannelData) entities.get(0);
        }
    }

With a simple flag that can be set we can decide whether we want to actually refresh the data from the cache or not. This results in either a mutable or immutable enity. Having this immutable behaviour should also be very performant as we do not refresh the data from the cache everytime a method is called and at the same time still allows us to share the cache among other bot instances.

Bastian commented 3 years ago

This results in either a mutable or immutable enity.

This will not work for collections (e.g., Server#getChannels()) though. These are always fetched from the cache and the cache itself is mutable.

felldo commented 3 years ago

Then this will result in a sort of semi immutable. But once you get the channels from the cache they are immutable entities which you can store in a variable to process them in your task. I think that's a good trade off though.

Javacord / Javacord

Concept: Cache module #805

Storing data objects

Using the data objects

Updating the cache

Considerations