Closed drennalls closed 5 years ago
+1 on this symptom - since we upgraded to PCF 2.2.5 and enabled GrootFS, the DCR/RCR values in cf are often inaccurate. It would be awesome if anyone has any ideas on how to fix this!
@JonRavenscraft We started seeing this as well after a recent upgrade to PCF 2.2.
Seeing similar in PCF 2.3.5(2.3.8) with GrootFS and the differences in DCR/RCR. But one item to note is that when we go into Apps Manager UI, we see where the number of instances exist(DCR), accept the RCR version will have one showing no CPU/memory etc. Says its up. Could this be what cf top is reporting as not running?
"RCR" is "Reporting Container" -- cf top
uses stats from the metrics of each container. If a container does not report in within 60 seconds (it should emit metrics every 30 seconds), cf top
will consider the container to be down.
So one possibility is if cf top
is running across a network that can't keep up with the firehose network traffic such that its dropping firehose events. This should be flagged within cf top
with a noticed about inaccurate information.
I have not done a lot of testing in newer versions of PCF -- I'm wondering if the container metrics are being emitted at a different frequency now.
@JonRavenscraft and @skertz you both mentioned "GrootFS". What PCF settings do I need to modify to enabled this? Any setup/config help you can provide to try and replicate this issue would be appreciated.
Seems like starting in PCF 2.4 GrootFS is not configurable -- it is always used.
Having said that, I'm testing against PCF 2.4 and have not been able to replicate this issue. Any help in providing scenario that leads to incorrect DCR/RCR values would be helpful.
Regarding cf top and RCR -- RCR is defined as reporting containers which top gets from the firehose data. Each container is intended to report in every 15-30 seconds with its current telemetry (CPU, memory, etc). If cf top does not hear from a given container for more then 2 minutes (which means that its missed several firehose container updates), then it considers the container at "not reporting". This can happen for a number of reasons.
1) If the firehose is dropping messages -- which can happen if the firehose is very busy and the location where you are running cf top is not local to the platform (e.g., you are running it on a laptop which means that all firehose data has to navigate from the platform through lots of network hops to get to your laptop vis running in on a VM that is close to the platform from a network perspective.
or
2) I have seen this when a Diego cell is running at such high CPU that the containers running on the given diego cell do not have enough CPU cycles to report their telemetry. In the second case, PCF says everything is good but technically the containers running on the given Diego cell really are not running well.
I'm closing this issue as not reproducible (if running cf top local to PCF platform)
First, thanks for the work on this plugin, it's been tremendously helpful troubleshooting our PCF development deployment !
In a recent troubleshooting session we had apps crashing due to the Diego cells being low on memory. I stopped a bunch of apps to free up some memory, things settled down and there only were 3 apps left crashing (for other reasons). However
cf top
still continued to display an alert like this..ALERT: N applications not in desired state (DCR != RCR column)
where the N jumped all over the place between 100-300. Manually going through orgs and doingcf apps
didn't reflect that, everything seemed to be fine. I then tried the firehose plugin and the LRP numbers also didn't match whatcf top
was telling me..For e.g. below it shows 538 for
LRPsDesired
and 535 forLRPsRunning
..so I'm not sure why there's a discrepancy between the firehose metrics and what
cf top
is reporting. Also when I drill down to a particular org and then an app that's supposed to be crashing, I can see itsRCR
total alternating between 0 and 1 even though doing acf logs
on the app in parallel shows no crashing happening.We've had alot of apps crashing in the last few days, is it possible "old" events/historical data are being used somehow ?