Autonomous applications need a way to store application data within the Apocryph network. This data needs to be, naturally, only available to the application itself; and in addition, it should be to survive provider failures and similar extreme events. In this spike, we will run a database inside Apocryph in order to pave the way for future applications, in particular the Marketplace autonomous application.
Implementation
After researching and discarding PostgreSQL (due to the complexity of maintaining a globally-distributed cluster, without benefiting at all from Postgre's strong consistency in the case of the marketplace), it seems it would be best to go with a suitable NoSQL database; in particular, ScyllaDB seems like a good candidate.
In this case, the storage for an application will be a singular massively-distributed ScyllaDB database cluster. ScyllaDB will automatically manage replication across regions and should be able to pull data from regions that have it, in case it's not present locally - thus allowing an application to scale beyond the storage capacity of any individual server.
Then, we will have some scripts which manage joining new nodes and establishing secure connections to them. We make sure they hook into the Autoscaler autonomous application, allowing them to know where other nodes are.
For the application itself (not part of this spike), we will bundle it together with the database as one pod, and thus enable the application to directly connect to the locally-running database node and use it as it sees fit. At some point, we might have two autoscaled deployments, one for the frontend/application and the other for the database (or, one deployment for DB+application and one for just DB), but this will come around later.
To allow the database instances have storage space, we could do that by just having each pod request a fixed amount of disk space (say, ~50GB), and possibly allow deploying multiple pods per node. Alternatively, we can extend the core protocol to allow for either changing volumes' sizes dynamically or attaching extra storage space so that the database can better scale to the available and needed capacity.
Implementation
Steps to getting a running DB cluster within Apocryph:
[x] 1. Get the DB cluster running within either Docker Compose or multiple minikube instances
For ScyllaDB, read the relevant documentation on installing and running Scylla. A caveat is that we must avoid running through k8s operators, except perhaps as inspiration; since we don't have operators within Apocryph.
The main objectives of this step are getting information on the DB's requirements and deployment procedures.
[x] 2. Move the DB cluster to run as one/two Apocryph deployments - on one or two providers
In this step, we ideally want to make the deployment as automatic as possible - to the point of disregarding problems related to security and access control.
Note that we would most likely have to make a deployment without the HTTP-based autoscaling from the core protocol, as the database is likely to be always up.
The main objective of this step is automating the DB's deployment procedure.
[ ] 3a. Make the DB cluster use the autoscaler to launch additional instances
The autoscaler's data on currently-active instances will have to be integrated with the cluster, so that it can automatically rebalance/deactivate failed instances. Depending on how the autoscaler keeps track of alive instances, we might need to provide feedback to the autoscaler, so that it can officially take out instances that we can no longer reach.
In addition, new instances from the autoscaler should be integrated into the cluster.
The main objective of this step is enhancing the autoscaler with the necessary hooks for the database, and integrating the two.
[-] 3b. Make the DB cluster use proper aTLS for communications (Partially done: using Nebula (not fully working))
While in step 2 we just used raw TCP sockets, now it's time to secure all communications. At first, this can probably happen on the DB container level, but in general, this is something we would like the autoscaler or core protocol to manage if possible. Ideally, we will use aTLS which attests to the fact that the peer has a valid TEE and is running the same container image; then as an extra step, we would check that it's part of the same application cluster / has the same identity. Step 3a should provide us with a dynamically-changing list of peers we can connect to that are part of our cluster.
If the DB component is used as part of an aApp which has it's own wallet, we could, in theory, accept TLS connections signed with our own wallet's key as well, as anyone who has the key of the wallet has either already compromised the aApp or is the owner of the aApp.
The main objective of this step is figuring out how to make clusters in Apocryph - and regardless of what happens in the other steps, would probably be the most impactful part of the overall spike/task.
[ ] 4. Deploy a DB cluster using the autoscaler on a pair of Apocryph nodes
The final step would be testing the whole setup, and making sure that it works even as nodes die and are recreated.
Overview
Autonomous applications need a way to store application data within the Apocryph network. This data needs to be, naturally, only available to the application itself; and in addition, it should be to survive provider failures and similar extreme events. In this spike, we will run a database inside Apocryph in order to pave the way for future applications, in particular the Marketplace autonomous application.
Implementation
After researching and discarding PostgreSQL (due to the complexity of maintaining a globally-distributed cluster, without benefiting at all from Postgre's strong consistency in the case of the marketplace), it seems it would be best to go with a suitable NoSQL database; in particular, ScyllaDB seems like a good candidate.
In this case, the storage for an application will be a singular massively-distributed ScyllaDB database cluster. ScyllaDB will automatically manage replication across regions and should be able to pull data from regions that have it, in case it's not present locally - thus allowing an application to scale beyond the storage capacity of any individual server.
Then, we will have some scripts which manage joining new nodes and establishing secure connections to them. We make sure they hook into the Autoscaler autonomous application, allowing them to know where other nodes are.
For the application itself (not part of this spike), we will bundle it together with the database as one pod, and thus enable the application to directly connect to the locally-running database node and use it as it sees fit. At some point, we might have two autoscaled deployments, one for the frontend/application and the other for the database (or, one deployment for DB+application and one for just DB), but this will come around later.
To allow the database instances have storage space, we could do that by just having each pod request a fixed amount of disk space (say, ~50GB), and possibly allow deploying multiple pods per node. Alternatively, we can extend the core protocol to allow for either changing volumes' sizes dynamically or attaching extra storage space so that the database can better scale to the available and needed capacity.
Implementation
Steps to getting a running DB cluster within Apocryph:
[x] 1. Get the DB cluster running within either Docker Compose or multiple minikube instances
For ScyllaDB, read the relevant documentation on installing and running Scylla. A caveat is that we must avoid running through k8s operators, except perhaps as inspiration; since we don't have operators within Apocryph.
The main objectives of this step are getting information on the DB's requirements and deployment procedures.
[x] 2. Move the DB cluster to run as one/two Apocryph deployments - on one or two providers
In this step, we ideally want to make the deployment as automatic as possible - to the point of disregarding problems related to security and access control. Note that we would most likely have to make a deployment without the HTTP-based autoscaling from the core protocol, as the database is likely to be always up.
The main objective of this step is automating the DB's deployment procedure.
[ ] 3a. Make the DB cluster use the autoscaler to launch additional instances
The autoscaler's data on currently-active instances will have to be integrated with the cluster, so that it can automatically rebalance/deactivate failed instances. Depending on how the autoscaler keeps track of alive instances, we might need to provide feedback to the autoscaler, so that it can officially take out instances that we can no longer reach. In addition, new instances from the autoscaler should be integrated into the cluster.
The main objective of this step is enhancing the autoscaler with the necessary hooks for the database, and integrating the two.
[-] 3b. Make the DB cluster use proper aTLS for communications (Partially done: using Nebula (not fully working))
While in step 2 we just used raw TCP sockets, now it's time to secure all communications. At first, this can probably happen on the DB container level, but in general, this is something we would like the autoscaler or core protocol to manage if possible. Ideally, we will use aTLS which attests to the fact that the peer has a valid TEE and is running the same container image; then as an extra step, we would check that it's part of the same application cluster / has the same identity. Step 3a should provide us with a dynamically-changing list of peers we can connect to that are part of our cluster. If the DB component is used as part of an aApp which has it's own wallet, we could, in theory, accept TLS connections signed with our own wallet's key as well, as anyone who has the key of the wallet has either already compromised the aApp or is the owner of the aApp.
The main objective of this step is figuring out how to make clusters in Apocryph - and regardless of what happens in the other steps, would probably be the most impactful part of the overall spike/task.
[ ] 4. Deploy a DB cluster using the autoscaler on a pair of Apocryph nodes
The final step would be testing the whole setup, and making sure that it works even as nodes die and are recreated.