app crash stats can be inaccurate

drennalls commented 5 years ago

First, thanks for the work on this plugin, it's been tremendously helpful troubleshooting our PCF development deployment !

In a recent troubleshooting session we had apps crashing due to the Diego cells being low on memory. I stopped a bunch of apps to free up some memory, things settled down and there only were 3 apps left crashing (for other reasons). However cf top still continued to display an alert like this.. ALERT: N applications not in desired state (DCR != RCR column) where the N jumped all over the place between 100-300. Manually going through orgs and doing cf apps didn't reflect that, everything seemed to be fine. I then tried the firehose plugin and the LRP numbers also didn't match what cf top was telling me..

For e.g. below it shows 538 for LRPsDesired and 535 for LRPsRunning

08:23 $ cf nozzle -n | grep -i 'LRPsDesired\|LRPsRunning'
origin:"bbs" eventType:ValueMetric timestamp:1543497812788903677 deployment:"cf" job:"diego_database" index:"f6737d5e-ff3d-48cd-8ea3-69190cf14d1a" ip:"172.19.1.11" tags:<key:"instance_id" value:"f6737d5e-ff3d-48cd-8ea3-69190cf14d1a" > tags:<key:"source_id" value:"bbs" > valueMetric:<name:"LRPsRunning" value:535 unit:"Metric" >
origin:"bbs" eventType:ValueMetric timestamp:1543497812788915438 deployment:"cf" job:"diego_database" index:"f6737d5e-ff3d-48cd-8ea3-69190cf14d1a" ip:"172.19.1.11" tags:<key:"instance_id" value:"f6737d5e-ff3d-48cd-8ea3-69190cf14d1a" > tags:<key:"source_id" value:"bbs" > valueMetric:<name:"LRPsDesired" value:538 unit:"Metric" >

..so I'm not sure why there's a discrepancy between the firehose metrics and what cf top is reporting. Also when I drill down to a particular org and then an app that's supposed to be crashing, I can see its RCR total alternating between 0 and 1 even though doing a cf logs on the app in parallel shows no crashing happening.

We've had alot of apps crashing in the last few days, is it possible "old" events/historical data are being used somehow ?

JonRavenscraft commented 5 years ago

+1 on this symptom - since we upgraded to PCF 2.2.5 and enabled GrootFS, the DCR/RCR values in cf are often inaccurate. It would be awesome if anyone has any ideas on how to fix this!

drennalls commented 5 years ago

@JonRavenscraft We started seeing this as well after a recent upgrade to PCF 2.2.

skertz commented 5 years ago

Seeing similar in PCF 2.3.5(2.3.8) with GrootFS and the differences in DCR/RCR. But one item to note is that when we go into Apps Manager UI, we see where the number of instances exist(DCR), accept the RCR version will have one showing no CPU/memory etc. Says its up. Could this be what cf top is reporting as not running?

kkellner-cl commented 5 years ago

"RCR" is "Reporting Container" -- cf top uses stats from the metrics of each container. If a container does not report in within 60 seconds (it should emit metrics every 30 seconds), cf top will consider the container to be down.

So one possibility is if cf top is running across a network that can't keep up with the firehose network traffic such that its dropping firehose events. This should be flagged within cf top with a noticed about inaccurate information.

I have not done a lot of testing in newer versions of PCF -- I'm wondering if the container metrics are being emitted at a different frequency now.

kkellner-cl commented 5 years ago

@JonRavenscraft and @skertz you both mentioned "GrootFS". What PCF settings do I need to modify to enabled this? Any setup/config help you can provide to try and replicate this issue would be appreciated.

kkellner-cl commented 5 years ago

Seems like starting in PCF 2.4 GrootFS is not configurable -- it is always used.

Having said that, I'm testing against PCF 2.4 and have not been able to replicate this issue. Any help in providing scenario that leads to incorrect DCR/RCR values would be helpful.

kkellner commented 5 years ago

Regarding cf top and RCR -- RCR is defined as reporting containers which top gets from the firehose data. Each container is intended to report in every 15-30 seconds with its current telemetry (CPU, memory, etc). If cf top does not hear from a given container for more then 2 minutes (which means that its missed several firehose container updates), then it considers the container at "not reporting". This can happen for a number of reasons.

1) If the firehose is dropping messages -- which can happen if the firehose is very busy and the location where you are running cf top is not local to the platform (e.g., you are running it on a laptop which means that all firehose data has to navigate from the platform through lots of network hops to get to your laptop vis running in on a VM that is close to the platform from a network perspective.

or

2) I have seen this when a Diego cell is running at such high CPU that the containers running on the given diego cell do not have enough CPU cycles to report their telemetry. In the second case, PCF says everything is good but technically the containers running on the given Diego cell really are not running well.

I'm closing this issue as not reproducible (if running cf top local to PCF platform)

ECSTeam / cloudfoundry-top-plugin

app crash stats can be inaccurate #14