cryostatio / cryostat

Other
7 stars 8 forks source link

[Bug] Agent refresh can cause target loss notification and removal of active recordings #543

Open andrewazores opened 5 days ago

andrewazores commented 5 days ago

Current Behavior

When deploying Cryostat and applications using the Cryostat Agent, there is a bug in the Agent registration refresh system.

https://github.com/cryostatio/cryostat/blob/727d14bbed2131eccd44f01dd23306d01e49279d/src/main/resources/application.properties#L7

https://github.com/cryostatio/cryostat/blob/727d14bbed2131eccd44f01dd23306d01e49279d/src/main/java/io/cryostat/discovery/Discovery.java#L275

When the discovery ping period is hit, this causes Cryostat to send a request to the Agent instance, informing it that its token will be expiring and it should refresh its registration to acquire a new one.

https://github.com/cryostatio/cryostat/blob/727d14bbed2131eccd44f01dd23306d01e49279d/src/main/java/io/cryostat/discovery/Discovery.java#L432

When the Agent does this, Cryostat seems to respond by dropping the Agent (plugin) registration from the database, which cascades to removing its Target children nodes, before recreating a new entry for it and restoring the plugin's node children (the Target instance(s)). Removing the Target nodes also cascades to deleting any Active Recordings associated with each of those nodes. This results in an interruption to JFR recording and a gap in the captured data.

Expected Behavior

The registration flow should generally work in the same manner, but the refresh step should not cause the plugin entity to be deleted and recreated. This should preserve the Target node definitions and therefore the associated Active Recording definitions, so that there is no interruption to JFR data capture.

Steps To Reproduce

  1. Deploy Cryostat and some Agent-equipped sample app. For example, ./smoktest.bash -Ot.
  2. Manually start a recording on the Agent-equipped sample app using the http:// target definition.
  3. Wait for the discovery ping period to elapse.
  4. There will be a matching pair of target discovery lost/found notifications due to the Agent plugin refresh step.
  5. Revisit the Active Recordings table with the Agent HTTP target definition selected. Notice that the manually-started recording is no longer present.

Environment

No response

Anything else?

No response

andrewazores commented 5 days ago

Hmm. I observe this in one particular OpenShift deployment, but it doesn't reproduce using the smoketest.bash setup outlined above. Digging deeper to see what the difference is.