Come up with architecture for a scalable and highly available secondary server

VJag commented 5 months ago

Is your feature request related to a problem? Please describe.

The secondary server as it is today is not scalable. The only possible scaling option today is vertical scaling. Because of the way our persistence works, it is not possible to run a secondary server per region and honor data locality, replication, etc..

Describe the solution you'd like

Come up with a design for the problem described above. The task will have the following sub-tasks:

Requirements Analysis:

Understand the current and anticipated future requirements: data volume, traffic patterns, and performance expectations.

Scalability Considerations:

Horizontal Scaling: Plan for distributing the load across multiple servers or instances. Implement load balancing mechanisms to evenly distribute incoming traffic. Vertical Scaling: Consider scaling up resources (CPU, RAM) on individual servers if needed, although horizontal scaling often provides better long-term scalability.

High Availability Design:

Redundancy and Failover: Design the system with redundancy in mind to mitigate single points of failure. Implement failover mechanisms to ensure continuous service in case of server failures. Replication: Employ data replication strategies to duplicate data across multiple servers or regions for resilience and data availability.

Fault-Tolerant Architecture: Use fault-tolerant technologies and practices to handle failures without service disruptions.

Database Considerations:

Scalable Database: Choose a database system that can scale horizontally

Replication and Backups: Implement database replication for data redundancy and backups to prevent data loss in case of failures.

Load Balancing and Traffic Management:

Implement load balancers to distribute incoming traffic evenly across multiple servers or regions.

Describe alternatives you've considered

No response

Additional context

No response

VJag commented 5 months ago

Here is the document that captures aspects related to this ticket : https://docs.google.com/document/d/1UNgcTBlvDSCqX5N-ai05vl6vrGBfTtgdPh-j50TibkU/edit?usp=sharing

VJag commented 4 months ago

As part of the ongoing ticket, our team has initiated efforts to benchmark server performance. The key objectives for this task include:

Designing and implementing a robust framework for stress testing. The code should be optimized to seamlessly run within a virtual machine (VM) environment.
Developing the actual stress testing code, running it locally, and ensuring that it not only accomplishes its intended purposes but also operates efficiently within a VM setting.
Presenting and demonstrating the outcomes of this work to a broader audience through an architecture call or stand-up meeting.
Collaborating with Chris to execute the benchmarking against the production environment atSigns.

The timeline for achieving objectives 3 and 4 is within the current sprint (PR 81). The results of this benchmarking exercise will play a crucial role in shaping our subsequent actions and decisions.

purnimavenkatasubbu commented 4 months ago

In the PR-81, We spent time writing tests for the following along with the documentation and demonstrated them in the architecture call

Parallel_put_sync test
sync_pull_load test
parallel_notify_same_atsign test
monitor_test

The goal in this sprint is to expand the tests to cover all the notification scenarios and work with Chris to execute the benchmarking against the production environment atSigns.

The documentation of the tests completed so far can be found in the branch

Also, Planning to explore locust for load testing - https://locust.io/

purnimavenkatasubbu commented 3 months ago

Done with the phase -1 of writing scripts for the above mentioned scenarios and moved on to the locust Script to run multiple clients performing an unauthenticated scan/Info. Locust script can be found locust_test_script

Next Goal is to narrow down on the performance i.e.., To get the metrics for the following scenarios and be able to predict the point where the server breaks down

Memory consumed for starting a fresh secondary server
How the Memory consumed increases as the no of connections increases
how memory allocated for hive is in proportion to size of the keys in steady state
How memory increases as the no of clients running the scan verb(through locust scripts) and at what point the server breaks.

Details collected so far can be seen in the following sheet. performance_metrics

purnimavenkatasubbu commented 3 months ago

During this Sprint, we utilized the locust script to conduct a series of performance tests aimed at evaluating the scalability and resilience of our server infrastructure. Specifically, we focused on conducting lookup tests wherein we systematically increased both the number of client connections and the number of keys stored within the server.

Test Conditions:

Number of Keys: We systematically increased the quantity of keys stored within the server. We initiated the test with 5 unique keys and incrementally expanded it to 10, 100, 1000, and eventually 10,000 keys.

Number of Clients: Simultaneously, we varied the number of client connections accessing the server. Beginning with a single client, we progressively scaled up the load to 10, 100, 200, 500, 1000, and ultimately 10,000 concurrent clients.

Key size - 1) Not fixed size 2) 240 Characters
Value size - 1k
Test Duration - 1) 30 seconds 2) 1 minute 3) 2 minutes

All the collected performance test metrics can be found in the following sheet. Load_testing_metrics

purnimavenkatasubbu commented 2 months ago

We collected metrics by running both client and server on the same VM. Metrics can be found in the following Same_VM_Metrics.

Now, we aim to run the server and client on separate virtual machines (VMs) to ensure that their simultaneous operation on the machine does not impact overall performance.

atsign-foundation / at_server