chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
64 stars 14 forks source link

[WMG] Load testing #5134

Open tihuan opened 1 year ago

tihuan commented 1 year ago

Context: Slack

Currently we are unsure how many concurrent users we can handle in WMG. Amanda pointed out that our concurrent user peak was 8 users / min in the last marketing push, so if we could at least handle 8 concurrent users, we'd probably be fine.

For this ticket, lets reach an understanding of how many concurrent calls we can make before we reach performance degradation, and understand if there are certain tweaks to configurations that can give us some easy wins.

Definition of Done:

  1. Ensure we can handle 30 concurrent users - make config adjustments as necessary
  2. Find out the max number of concurrent users we can currently support
  3. Define a happy path user journey (proposed below) for load testing.
  4. Observe API calls throughout the user journey and simulate the query sequences at the scale of 30 concurrent users.
  5. PLEASE TEST ON STAGING: Scale up staging env temporarily to match prod config to simulate prod performance

Feel free to write a script and make assumptions on the number of user calls to the backend. For instance, 1 user on the site = in conjunction, calling { /collections/index /datasets/index /primary_filter_dimensions /filters /query with 3 tissues, 3 genes, one of each 5 filters }


Current Proposed happy path user journey (prior to Carolines adding publications to the filter endpoint):

Step 1. Land on WMG Step 2. Select 3 tissues, 3 genes, one of each 5 filters (dataset, disease, ethnicity, publication, and sex) Step 3. Add groupBy disease

Number of queries: Step 1:

  1. First hits /collections/index and /datasets/index
  2. Then hits N collection endpoints (N being the number of collections in the env. For dev, it’s around 150) - browser sends ~6 requests concurrently
  3. After that, /primary_filter_dimensions and /filters

Step 2:

  1. Hit /query with 3 tissues, 3 genes, one of each 5 filters (dataset, disease, ethnicity, publication, and sex)

Step 3:

  1. Hit /query with the previous query + groupBy disease
tihuan commented 1 year ago

Saving old descriptions for future use:

👇

Problem:

  1. WMG endpoints not able to handle large amount of requests, being DDOSed by e2e tests
  2. Data Portal should handle peak load at 8K per historical data from Nik
  3. Gene Expression should handle peak load at 2K

Solution:

  1. Set up container to manage WMG endpoints
  2. Set up stress testing framework to ensure we can handle the loads above
  3. Also consider a wider pre marketing prep work in an epic to answer the relevant questions generated by CGPT below:
1. What is the projected increase in the number of users per day? Determine the expected surge in user activity to estimate the scale of the impact on your systems.
2. Have you conducted load testing? Perform rigorous load testing to simulate high user loads and identify any bottlenecks or limitations in your infrastructure. This will help you understand how your systems will handle the anticipated surge.
3. Is your infrastructure scalable? Assess your infrastructure’s scalability and ensure it can handle increased traffic. Consider using cloud-based solutions that allow for easy scaling of resources to accommodate growing demands.
4. What are the performance benchmarks? Establish performance benchmarks for your systems under regular conditions and determine acceptable response times. Compare these benchmarks against the projected surge to identify any potential issues.
5. Have you optimized your code and databases? Review your codebase and database architecture to identify any potential performance optimizations. Implement caching mechanisms, query optimizations, and efficient algorithms to handle increased user loads.
6. Do you have a caching strategy? Implementing an effective caching strategy can help reduce the load on your servers. Consider using caching mechanisms for frequently accessed data or content to improve response times and overall system performance.
7. How will you handle user authentication and authorization? Ensure your authentication and authorization mechanisms can handle the increased load. Consider using scalable authentication services or implementing distributed session management techniques.
8. Do you have a monitoring and alerting system in place? Implement a robust monitoring system to track key performance metrics, server health, and user activity. Configure alerts to notify you of any anomalies or potential issues during the surge.
9. Have you planned for increased storage requirements? Estimate the additional storage requirements resulting from the surge in user activity. Ensure your infrastructure has enough storage capacity to handle increased data volumes.
10. What is your disaster recovery plan? Prepare a disaster recovery plan to mitigate the impact of any potential failures or outages. Implement redundancy measures, backup systems, and ensure you have a plan for rapid recovery in case of any disruptions.
11. Are your support and customer service teams prepared? Ensure your support and customer service teams are adequately trained and prepared to handle increased user inquiries, issues, and feedback during the surge.
dsadgat commented 1 year ago

@atarashansky could you post some of the load testing results? Seems this ticket is still unassigned, was this probing done? ty

atarashansky commented 1 year ago

@signechambers1 required for refinement based on recent anecdotal observations about WMG outages as a result of the recent marketing push