cryostatio / cryostat-web

Web front-end for Cryostat: Secure JDK Flight Recorder management for containerized JVMs
https://cryostat.io/
Other
10 stars 20 forks source link

[Bug] Topology start recording action should not silently restart all recordings #1035

Open tthvo opened 1 year ago

tthvo commented 1 year ago

Current Behavior:

When I start a recording in the topology view it appears to overwrite the previous recording from the Topolgy view. It is not a good idea to just assume overwriting is fine. This ultimately leads to data loss to the un-informed customer. Data Loss is never good in IT.

Commented by Scott.

Expected Behavior:

I think the best thing to do here is actually just ignore the request on any targets that already have such a recording running, since starting a new parallel active recording in the same running state doesn't actually capture any additional data. If the recording doesn't exist, it should be created and started. If it exists but has been stopped, it should be restarted. If it is already running just do nothing.

And maybe in the UI we can disable the action to start a recording if all of the child nodes already have the recording in the running state

Suggested by Andrew.

Anything else:

Reference: #906

Depends on: https://github.com/cryostatio/cryostat/issues/1492

andrewazores commented 1 year ago

And maybe in the UI we can disable the action to start a recording if all of the child nodes already have the recording in the running state

This would be a -web feature since it involves a UI component state,

I think the best thing to do here is actually just ignore the request on any targets that already have such a recording running, since starting a new parallel active recording in the same running state doesn't actually capture any additional data. If the recording doesn't exist, it should be created and started. If it exists but has been stopped, it should be restarted. If it is already running just do nothing.

This I think probably can and should be done server-side.

https://github.com/cryostatio/cryostat/blob/2b0cd5adacf8f928f8666f36663b0a5b31e0e76b/src/main/java/io/cryostat/net/web/http/api/v2/graph/StartRecordingOnTargetMutator.java#L181

https://github.com/cryostatio/cryostat/blob/2b0cd5adacf8f928f8666f36663b0a5b31e0e76b/src/main/java/io/cryostat/recordings/RecordingTargetHelper.java#L157

Currently, when requesting to start a recording, the client can use a query parameter or GraphQL argument restart to ask the server to close and re-open any recordings that already exist with the same name as in the new request. If this parameter is not provided or is false then the server will either create the new recording or will fail with an exception. If the parameter is true then the server will either create the new recording, or will stop the existing recording and start a new one with the other request parameters (event template, maxAge/maxSize settings, etc.).

I think this behaviour makes sense to retain, and it is already part of published API so it would be nice to be able to enhance this while keeping the same request format. I would propose that the query parameter ?restart=[true|false] be kept but documented as deprecated, and a new replacement like ?replace=[always,stopped,never] introduced. ?replace=always would behave the same as ?restart=true, ?replace=never would behave the same as ?restart=false, and ?replace=stopped would implement the quoted behaviour. If replace is not specified then it should behave like ?replace=never, so that this also maintains consistency and compatibility with the existing ?restart=false default behaviour.

tthvo commented 1 year ago

That would be a much better idea! Open backend issue: https://github.com/cryostatio/cryostat/issues/1492

aali309 commented 1 year ago

@andrewazores @tthvo I think this has been solved? I tried reproducing and all seems good. Recording with the same name wont start and it will produce and error saying recording with than name exists. if restart is always it restarts the recording in stopped or running mode. ?? Any suggestions?

andrewazores commented 1 year ago

I think this probably is solved - your https://github.com/cryostatio/cryostat/pull/1587 would have been the backend fix.

andrewazores commented 1 year ago

To clarify though, it sounds like the workflow you're describing is based on the Recordings > Create Recording view? This issue refers to the bulk creation action on the Topology view:

Screenshot_2023-10-05_12-32-36

tthvo commented 1 year ago

Can we do something like:

If there is at least 1 target that has the active recording in RUNNING state, open a modal to ask the user if they want to continue the action.

For this action, we can reuse the following with additional filter (i.e. state == RUNNING).

https://github.com/cryostatio/cryostat-web/blob/ed7c2011bcd0d2c551309af6a235145dfdf7cb1b/src/app/Topology/Actions/utils.tsx#L42

Of course, if allowed, we can do the same for Delete Recording just to keep it consistent with the rest of the "dangerous" deletion actions elsewhere.

andrewazores commented 1 year ago

If there is at least 1 target that has the active recording in RUNNING state, open a modal to ask the user if they want to continue the action.

And not prompt if all the targets have no recordings... ? I think if a prompt is shown at all on this action, it should always be shown (until the user checks the "don't ask again" box), and the request to the backend should use ?replace=stopped. This way it is a non-destructive action and will never cause JFR data loss.

If we're displaying a modal we could also ask the user which event template they want to use...

Of course, if allowed, we can do the same for Delete Recording just to keep it consistent with the rest of the "dangerous" deletion actions elsewhere.

Yes, I think the Stop and Delete actions here should have a confirmation modal.

tthvo commented 1 year ago

And not prompt if all the targets have no recordings... ? I think if a prompt is shown at all on this action, it should always be shown (until the user checks the "don't ask again" box), and the request to the backend should use ?replace=stopped. This way it is a non-destructive action and will never cause JFR data loss.

Oh yess this sounds good!! Was just concerned about a bunch of error notifications when recordings exist.

I am thinking if we go in this direction, why not make it into a modal for interactive recording creation. Basically, CreateRecording view, so users can create any recording, not just a special predefine one. As for other actions, there can be a list of recordings to choose from. All should be supported neatly with GraphQl?

It should fit our goal for multi target design later on. Wdyt?

andrewazores commented 1 year ago

Oh yess this sounds good!! Was just concerned about a bunch of error notifications when recordings exist.

I don't think there should be error notifications if we use replace=stopped.

I am thinking if we go in this direction, why not make it into a modal for interactive recording creation. Basically, CreateRecording view, so users can create any recording, not just a special predefine one.

See last point below for more detail - but I think this could be a good idea, with some refinement. My first thought is that this might just feel like it gets in the way of achieving the user's goal of just getting some JFR data from their applications. Maybe instead of launching a full custom dialog every time, this uses a configuration that is tucked away in the client-side settings, similar to what the Automated Analysis Dashboard card does.

As for other actions, there can be a list of recordings to choose from. All should be supported neatly with GraphQl?

You mean for the stop/delete/archive actions? I think just always (re)starting a recording with a common name/label and then doing the stop/delete/archive actions based on that name/label is good enough for this case. Otherwise, there's a whole lot of extra complexity about which recordings show up in that list since it has to reflect the various options available across all of the targets in the selection, and how those actions behave depending on which recordings we display and if the chosen recording is actually present in all of the targets, etc. I think this is already a bit of a pain point in our Automated Rules UX when the user has to enter a matchExpression that matches a set of targets in order to select a Target-provided Event Template, but it isn't necessarily clear what is happening there, and JMX auth/SSL misconfigurations also get in the way, etc.

It should fit our goal for multi target design later on. Wdyt?

I think so, but we need to be careful about our designs so that the UX isn't just throwing more and more endless possibilities and options at the user. It would probably be better to have two or three different simpler workflows all built around this idea and on top of these backend capabilities, rather than to build one giant generic interface that lets the user do everything but requires them to deeply study every option to figure out what makes sense.

tthvo commented 1 year ago

I don't think there should be error notifications if we use replace=stopped.

Ah sorry I was looking at this:

https://github.com/cryostatio/cryostat/blob/e9ddcaf6503df2e7bc0ef35cc418d0cb01730c7e/src/main/java/io/cryostat/recordings/RecordingTargetHelper.java#L207

Looks like it errors out if return false on line 151. And error will be extracted and notifications will send as here.

https://github.com/cryostatio/cryostat-web/blob/ed7c2011bcd0d2c551309af6a235145dfdf7cb1b/src/app/Topology/Actions/utils.tsx#L296

The function can just be refined to recognize "recording exists" error and silent it. What do u think @andrewazores?

I think so, but we need to be careful about our designs so that the UX isn't just throwing more and more endless possibilities and options at the user. It would probably be better to have two or three different simpler workflows all built around this idea and on top of these backend capabilities, rather than to build one giant generic interface that lets the user do everything but requires them to deeply study every option to figure out what makes sense.

Right! I didn't think through enough. Looks likes the idea complicated things even more, especially the archive/stop/delete action.

Maybe instead of launching a full custom dialog every time, this uses a configuration that is tucked away in the client-side settings, similar to what the Automated Analysis Dashboard card does.

The idea with AA Card sounds fit for this case.

tthvo commented 1 year ago

See last point below for more detail - but I think this could be a good idea, with some refinement.

Another simpler way to achieve this could be after we have a design for multi-target UI: The group action can just redirect to corresponding view (i.e. create recording) with descendant targets pre-selected.

andrewazores commented 1 year ago

Ah sorry I was looking at this:

https://github.com/cryostatio/cryostat/blob/e9ddcaf6503df2e7bc0ef35cc418d0cb01730c7e/src/main/java/io/cryostat/recordings/RecordingTargetHelper.java#L207

Looks like it errors out if return false on line 151. And error will be extracted and notifications will send as here.

cryostat-web/src/app/Topology/Actions/utils.tsx

Line 296 in ed7c201 export const notifyGroupActionErrors = (

The function can just be refined to recognize "recording exists" error and silent it. What do u think @andrewazores?

Better to have the server return some distinct status code or perhaps use a response header to indicate this particular "failure" case then, and check those conditions on the frontend rather than relying on parsing the message string.

tthvo commented 1 year ago

Sounds good! Tho, we are using GraphQL in this case so status code will still be 200? Looks like we can refine the error field of the GraphQL response to include failure case, status, and resolution steps?

andrewazores commented 1 year ago

Hmm. Yes, something like that makes sense, but I'd need to look into it more deeply. Could you or @aali309 do some researching on GraphQL error statuses and see if there is a best practice around this before we dive into code changes?