Open bradennapier opened 4 years ago
thanks for reaching out @bradennapier. The sampling rate you've set looks like 10% and not 0.1% :)
10% of 129k TPM is 13k TPM which is close to what you're seeing.
Remember, 0.1
is 10%, 0.01
is 1% and 0.001
is 0.1%. See more here: https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html#performance-sampling.
as an aside, it's interesting that you ingest almost as many "metric" documents as sampled transaction documents. Could you share some numbers on how many transaction groups you have? This roughly corresponds to the number of endpoints/routes you have.
Yeah, sorry, that is what I meant. Was writing out quite a bit there and copying over params and got confused for a second :-)
We ingest simialr because we had to filter out all documents that were not sampled because your system could not handle it without us ramping up the cost of the Elastic Cloud bill well into the $5-10k a month levels -- you can see info on that with the links at the top where we implemented that.
OK I understand.
As @axw mentioned in https://github.com/elastic/apm/issues/151#issuecomment-534831316 we're working on giving you more options and better trade-offs when it comes to storage, ingestion and sampling. While we work on the improvements, my suggestion would be to lower the sample rate an additional notch. For busy sites, it's not uncommon to run at 0.1% for example.
You also have the option to ignore specific URLs if you're not getting much value from those being traced: https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html#ignore-urls
For now using 1 instance of APM with 4GB -- scaling to 2 instances with 8GB does not appear to have any effect on this issue.
I am having the same issue with my elastic cloud cluster. No matter how I scale APM up, the APM servers are never able to process all the traffic from all of my agents.
Attached is the screenshot when I have 3 x 4GB APM instances, configuration as below:
@teyou Thanks for chiming, we understand the issues and are actively working on them as mentioned in https://github.com/elastic/apm/issues/184#issuecomment-566554443. There are a number of overlaps here but #104 will have the biggest impact. In addition, we're working on improving stack monitoring to more accurately reflect the limits hit and provide guidance on how and when to scale up APM Server nodes.
I'm going to rename this issue to more accurately reflect what we're after.
@graphaelli Thanks, looking forward for the improvement! It would be great if the team can introduce an auto-scaling feature to scale out during peak hour & scale in APM servers during the off-peak hour. it would help save customer bills
Hey y'all! As always, thanks for your hard work on this product. I am a shareholder and user and love what you guys do! Although I do have the occasional little issue for-which I like to write you guys novels to read because reading words can help your brain from going crazy from just reading code all day (so, in a way, I am helping you! :-P)
Since I am referencing a previous issue I am going to just use that as a base for the template - see elastic/apm#151 & https://github.com/elastic/apm-agent-nodejs/issues/1385 - please don't hurt me! or worse... close my issue!
I am using
7.5.0
across the board and thenodejs
agent.Summary
We sample
0.1%
of our requests and have a filter to make sure we only send sampled requests in any way to APM. Not sure if this was broken as this appears to have become an issue again with latest versions.So sending 0.1% AND not sending metrics for the other 99.9% of our requests -- and with around a $1,000 elastic cloud a month bill, APM apparently needs MUCH more to work. As stated in previous posts, the total cost to run APM is 2-100x more than to run our API itself.
I am also posting a screenshot of metrics - but keep in mind there were some deploys and reconfigurations in that time trying to figure this all out.
Configuration (Elastic Cloud)
Current Elastic Cloud configuration looks like below for the ES itself. Just scaled that up. For now using 1 instance of APM with 4GB -- scaling to 2 instances with 8GB does not appear to have any effect on this issue.
15 instances * 3 (base servers) + n (auto scale servers - currently 0)
so a minimum of 45 apm agents are running in prod then we also have other environments which would just be 15 each and would have very few events so 75 agents running total.The Problem
At times where there does not appear to be any change in traffic, requests, or any other known metric it appears the queue gets full -- the problem is this means im missing tons of requests and not getting a real picture of our traffic during these times. When this happens, all the instances start spamming non stop 503 errors.