Improve scalability on ESS

bradennapier commented 4 years ago

Hey y'all! As always, thanks for your hard work on this product. I am a shareholder and user and love what you guys do! Although I do have the occasional little issue for-which I like to write you guys novels to read because reading words can help your brain from going crazy from just reading code all day (so, in a way, I am helping you! :-P)

Since I am referencing a previous issue I am going to just use that as a base for the template - see elastic/apm#151 & https://github.com/elastic/apm-agent-nodejs/issues/1385 - please don't hurt me! or worse... close my issue!

I am using 7.5.0 across the board and the nodejs agent.

Summary

We sample 0.1% of our requests and have a filter to make sure we only send sampled requests in any way to APM. Not sure if this was broken as this appears to have become an issue again with latest versions.

So sending 0.1% AND not sending metrics for the other 99.9% of our requests -- and with around a $1,000 elastic cloud a month bill, APM apparently needs MUCH more to work. As stated in previous posts, the total cost to run APM is 2-100x more than to run our API itself.

I am also posting a screenshot of metrics - but keep in mind there were some deploys and reconfigurations in that time trying to figure this all out.

Configuration (Elastic Cloud)

Current Elastic Cloud configuration looks like below for the ES itself. Just scaled that up. For now using 1 instance of APM with 4GB -- scaling to 2 instances with 8GB does not appear to have any effect on this issue.

Using elastic cloud
Note that the configuration/setting shown is what it is now but have been trying various things so some screenshots/info might not perfectly match up
afaik the settings should all be done for me, but I also did try manually setting to 2 workers, 5120 bulk size, and 10240 for mem.events or w/e.
We do have a good number of instances running since we use pm2 and run 15 instances * 3 (base servers) + n (auto scale servers - currently 0) so a minimum of 45 apm agents are running in prod then we also have other environments which would just be 15 each and would have very few events so 75 agents running total.
In total our API is currently processing around 129,000 requests per minute according to DataDog (so the 14k tpm makes sense since we sample 0.1%)
This is a SIGNIFICANT (40%+-) reduction from a month or two ago due to new rate limiting techniques we have added so these recent updates appear to have made the issue far worse
As mentioned above, we have actually had to run apm filters to not send ANY metrics for non-sampled transactions but now are STILL running into this issue.

The Problem

At times where there does not appear to be any change in traffic, requests, or any other known metric it appears the queue gets full -- the problem is this means im missing tons of requests and not getting a real picture of our traffic during these times. When this happens, all the instances start spamming non stop 503 errors.

Note: Yes, I have read the info on the 503 in the documentation (along with every other word in the documentation)

6|api  | APM Server transport error (503): Unexpected APM Server response
6|api  | APM Server accepted 40 events in the last request
6|api  | Error: queue is full
8|api  | APM Server transport error (503): Unexpected APM Server response
8|api  | APM Server accepted 0 events in the last request
8|api  | Error: queue is full
1|api  | APM Server transport error (503): Unexpected APM Server response
1|api  | APM Server accepted 0 events in the last request
1|api  | Error: queue is full
13|api | APM Server transport error (503): Unexpected APM Server response
13|api | APM Server accepted 10 events in the last request
13|api | Error: queue is full
10|api | APM Server transport error (503): Unexpected APM Server response
10|api | APM Server accepted 10 events in the last request
10|api | Error: queue is full
4|api  | APM Server transport error (503): Unexpected APM Server response
4|api  | APM Server accepted 0 events in the last request
4|api  | Error: queue is full
9|api  | APM Server transport error (503): Unexpected APM Server response
9|api  | APM Server accepted 0 events in the last request
9|api  | Error: queue is full
11|api | APM Server transport error (503): Unexpected APM Server response
11|api | APM Server accepted 150 events in the last request
11|api | Error: queue is full
0|api  | APM Server transport error (503): Unexpected APM Server response
0|api  | APM Server accepted 0 events in the last request
0|api  | Error: queue is full
12|api | APM Server transport error (503): Unexpected APM Server response
12|api | APM Server accepted 0 events in the last request
12|api | Error: queue is full
14|api | APM Server transport error (503): Unexpected APM Server response
14|api | APM Server accepted 0 events in the last request
14|api | Error: queue is full
11|api | APM Server transport error (503): Unexpected APM Server response
11|api | APM Server accepted 60 events in the last request
11|api | Error: queue is full
2|api  | APM Server transport error (503): Unexpected APM Server response
2|api  | APM Server accepted 10 events in the last request
2|api  | Error: queue is full
4|api  | APM Server transport error (503): Unexpected APM Server response
4|api  | APM Server accepted 40 events in the last request
4|api  | Error: queue is full
9|api  | APM Server transport error (503): Unexpected APM Server response
9|api  | APM Server accepted 50 events in the last request
9|api  | Error: queue is full
13|api | APM Server transport error (503): Unexpected APM Server response
13|api | APM Server accepted 10 events in the last request
13|api | Error: queue is full
3|api  | APM Server transport error (503): Unexpected APM Server response
3|api  | APM Server accepted 10 events in the last request
3|api  | Error: queue is full
10|api | APM Server transport error (503): Unexpected APM Server response
10|api | APM Server accepted 30 events in the last request
10|api | Error: queue is full
15|api | APM Server transport error (503): Unexpected APM Server response
15|api | APM Server accepted 10 events in the last request
15|api | Error: queue is full
14|api | APM Server transport error (503): Unexpected APM Server response
14|api | APM Server accepted 10 events in the last request
14|api | Error: queue is full
7|api  | APM Server transport error (503): Unexpected APM Server response
7|api  | APM Server accepted 90 events in the last request
7|api  | Error: queue is full
5|api  | APM Server transport error (503): Unexpected APM Server response
5|api  | APM Server accepted 10 events in the last request
5|api  | Error: queue is full
6|api  | APM Server transport error (503): Unexpected APM Server response
6|api  | APM Server accepted 10 events in the last request
6|api  | Error: queue is full
4|api  | APM Server transport error (503): Unexpected APM Server response
4|api  | APM Server accepted 20 events in the last request
4|api  | Error: queue is full
11|api | APM Server transport error (503): Unexpected APM Server response

roncohen commented 4 years ago

thanks for reaching out @bradennapier. The sampling rate you've set looks like 10% and not 0.1% :)

10% of 129k TPM is 13k TPM which is close to what you're seeing.

Remember, 0.1 is 10%, 0.01 is 1% and 0.001 is 0.1%. See more here: https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html#performance-sampling.

roncohen commented 4 years ago

as an aside, it's interesting that you ingest almost as many "metric" documents as sampled transaction documents. Could you share some numbers on how many transaction groups you have? This roughly corresponds to the number of endpoints/routes you have.

bradennapier commented 4 years ago

Yeah, sorry, that is what I meant. Was writing out quite a bit there and copying over params and got confused for a second :-)

We ingest simialr because we had to filter out all documents that were not sampled because your system could not handle it without us ramping up the cost of the Elastic Cloud bill well into the $5-10k a month levels -- you can see info on that with the links at the top where we implemented that.

roncohen commented 4 years ago

OK I understand.

As @axw mentioned in https://github.com/elastic/apm/issues/151#issuecomment-534831316 we're working on giving you more options and better trade-offs when it comes to storage, ingestion and sampling. While we work on the improvements, my suggestion would be to lower the sample rate an additional notch. For busy sites, it's not uncommon to run at 0.1% for example.

You also have the option to ignore specific URLs if you're not getting much value from those being traced: https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html#ignore-urls

teyou commented 4 years ago

For now using 1 instance of APM with 4GB -- scaling to 2 instances with 8GB does not appear to have any effect on this issue.

I am having the same issue with my elastic cloud cluster. No matter how I scale APM up, the APM servers are never able to process all the traffic from all of my agents.

screencapture-05b739fcff894d0faebb3fb056503af2-us-east-1-aws-found-io-9243-app-monitoring-2020-05-31-11_18_10 Attached is the screenshot when I have 3 x 4GB APM instances, configuration as below:

graphaelli commented 4 years ago

@teyou Thanks for chiming, we understand the issues and are actively working on them as mentioned in https://github.com/elastic/apm/issues/184#issuecomment-566554443. There are a number of overlaps here but #104 will have the biggest impact. In addition, we're working on improving stack monitoring to more accurately reflect the limits hit and provide guidance on how and when to scale up APM Server nodes.

I'm going to rename this issue to more accurately reflect what we're after.

teyou commented 4 years ago

@graphaelli Thanks, looking forward for the improvement! It would be great if the team can introduce an auto-scaling feature to scale out during peak hour & scale in APM servers during the off-peak hour. it would help save customer bills

elastic / apm