[Infra UI] Error handling in anomaly detection job does not give user any meaningful guidance

roshan-elastic commented 10 months ago

Description

When there is an error creating a host anomaly detection job, the error messaging presents no information to the user to help them resolve the issue.

For example, the following error is presented when there aren't enough shards to create the job:

Example generic error message

API response

Example video of workflow

https://github.com/elastic/kibana/assets/117740680/26e1daa1-a7c1-4d30-a181-4ff80d883989

Expectation

It is expected that a user-friendly error state is returned with directions of how to resolve.

elasticmachine commented 10 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

roshan-elastic commented 10 months ago

@vinaychandrasekhar FYI (cc @grabowskit)

roshan-elastic commented 10 months ago

Hey @smith,

I captured this as a separate issue - I don't think this is the high priority that the serverless issue is but wanted to capture it on the backlog.

As its related, wanted to share it.

botelastic[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

iblancof commented 1 month ago

While testing to reproduce the error, I encountered two different issues:

https://github.com/user-attachments/assets/4ef66f93-ae25-46a3-b15a-616bc11a1c81

1 - The error message that is received is not displayed correctly (the issue specified in this ticket)

It appears that the error expected in the response to be parsed and displayed in the UI should be like this:

"error" : { "msg": "error message here"}

However, it is received in a different format, so no message is displayed because it cannot parse it.

"error": {
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "[datafeed-kibana-metrics-ui-default-default-hosts_memory_usage] cannot retrieve field [system.memory.actual.used.pct] because it has no mappings"
            }
        ],
        "type": "status_exception",
        "reason": "[datafeed-kibana-metrics-ui-default-default-hosts_memory_usage] cannot retrieve field [system.memory.actual.used.pct] because it has no mappings"
    },
    "status": 400
}

Questions

Which kind of message would we like to show? The raw one in the reason field? Do we use the type field to have user friendly messages? In that case what should it be?

2 - It shows an error even though everything has been created correctly

From what I see, all jobs come as success:true and the datafeeds as well, but one of them, the one with the error, has started:false.

It seems that the UI returns an error because it finds it in the response (withstarted:false), but apparently, everything has been executed correctly. If you reopen the view or visit the list of jobs, they all appear to be created.

This is the response:

{
    "jobs": [
        {
            "id": "kibana-metrics-ui-default-default-hosts_memory_usage",
            "success": true
        },
        {
            "id": "kibana-metrics-ui-default-default-hosts_network_in",
            "success": true
        },
        {
            "id": "kibana-metrics-ui-default-default-hosts_network_out",
            "success": true
        }
    ],
    "datafeeds": [
        {
            "id": "datafeed-kibana-metrics-ui-default-default-hosts_memory_usage",
            "success": true,
            "started": false,
            "error": {
                "error": {
                    "root_cause": [
                        {
                            "type": "status_exception",
                            "reason": "[datafeed-kibana-metrics-ui-default-default-hosts_memory_usage] cannot retrieve field [system.memory.actual.used.pct] because it has no mappings"
                        }
                    ],
                    "type": "status_exception",
                    "reason": "[datafeed-kibana-metrics-ui-default-default-hosts_memory_usage] cannot retrieve field [system.memory.actual.used.pct] because it has no mappings"
                },
                "status": 400
            }
        },
        {
            "id": "datafeed-kibana-metrics-ui-default-default-hosts_network_in",
            "success": true,
            "started": true,
            "awaitingMlNodeAllocation": false
        },
        {
            "id": "datafeed-kibana-metrics-ui-default-default-hosts_network_out",
            "success": true,
            "started": true,
            "awaitingMlNodeAllocation": false
        }
    ],
    "kibana": {}
}

Questions

Is it truly an error if a datafeed appears as not started while the related job is marked as successful?
How do we want to solve this?
- Not show any error if all jobs are successful? (I don't know if that information is reliable because the datafeed showing the error is also marked as successful)
- Investigate the discrepancies between job and datafeed statuses and errors and understand what is happening to provide meaningful information to our users

iblancof commented 1 month ago

@roshan-elastic I forgot to mention you in the previous comment.

I shared two situations I came across while trying to reproduce the bug and I've left a few questions that I think you might be able to answer to help us move forward with the issue.

crespocarlos commented 1 month ago

@iblancof thanks for the detailed evidence.

Which kind of message would we like to show? The raw one in the reason field? Do we use the type field to have user friendly messages? In that case what should it be?

I think it would be a friendlier message but somehow present the error details so that users can understand what happened and try to solve the problem. wdyt @roshan-elastic?

it shows an error even though everything has been created correctly

This UI is very simple. It could also tell users if one or more jobs had problems/or are stopped. Looks like in your recording that the memory datafeed is stopped. I don't know if users can investigate the root case from the ML UI.

Here is the doc explaining ML jobs and datafeeds https://www.elastic.co/guide/en/machine-learning/8.14/ml-ad-run-jobs.html#ml-ad-datafeeds

iblancof commented 1 month ago

@crespocarlos thanks for your answer!

For the first point let's wait for Roshan's answer ✅

For the second point I realized it's exactly the same that happens in this issue that was mentioned in a previous comment here. Users get an error even if everything is created as expected.

I think we can leave that out of the scope since there's already an issue created for that specific case.

roshan-elastic commented 1 month ago

Thanks @iblancof - great flag!

Is there any telemetry data emitted when the error shows? I'm trying to figure out how to prioritise based on how often this happens.

You can check the 'kibana-browser' XHR request for telemetry to see if there are any kind of markers we can look for in the data to help me get the numbers on this:

https://github.com/user-attachments/assets/048dd63b-3e34-4d4f-abf4-9aa1ae7a00cd.mp4

I've come across this before but it seemed to be an edge case.

It needs to be fixed though - I'm trying to figure out if this is a bug fix we can prioritise later or if this is affecting a significant number of users' ability to onboard.

roshan-elastic commented 1 month ago

FYI I'd imagine a fix for this would probably be two parts:

Surface the error message we get from the ML response directly in the error message
Ask the ML team to return user-friendly error messages so we don't have to deal with the burden of translating messages into actionable user-friendly messages

roshan-elastic commented 1 month ago

Oh wait @iblancof - looks like this work is being picked up by you already as part of the maintenance backlog?

In which case, don't worry about telemetry! We can think about a fix now. Gimme a few mins

roshan-elastic commented 1 month ago

OK - had time to think.

I don't think we're equipped to create custom error messages for every possible reason an ML job would fail - I think the ML team should own that. I'm not sure we even have any documentation we could point towards.

What do you both think about presenting the error messages returned by the ML job API and then asking the ML team to make these user friendly?

From a user experience, here's what I'd expect to see in the error message:

Error {category of error} {high-level explanation} {granular error details} {link to docs where they can learn more about what to do}

e.g. Error Could not create ML job due to lack of data You job you are trying to create does not have the relevant data. The fields fieldA and fieldB that the job relies on are not present. Please ensure you are collecting this data.

error: field A does not exist in index

Learn more

With the current error message, I feel like the best we can do is:

Error There was an error creating your ML job: {exact error message} learn more

(where 'learn more' would point to a documentation page owned by the ML team explaining the various potential error messages that users could have).

At least then the user would be able to search online for the error message or raise a support ticket.

WDYT?

Hey @arisonl - see the video above.

Do you think your team has appetite to help present user friendly error messages in the ML job API response that we could send to users?

iblancof commented 1 month ago

Thank you for taking the time to look into this @roshan-elastic!

I apologize for not being clearer about the priorities; I picked up the task to work on it as part of my onboarding, as Nathan suggested.

For my part, I'm fine with splitting the solution into two deliverables to mitigate the error in a more practical way and in a shorter time (does this work for you too, @crespocarlos ?).

I want to clarify that there could be more than one error since we're creating several jobs, and errors could occur in any of them. Therefore, we would display a list of errors.

Besides that, I'm wondering who I can ask about the link we want to put in "learn more." I understand it's a page that already exists. Sorry for the lack of context; I'm gradually gathering information.

crespocarlos commented 1 month ago

hey @iblancof

For my part, I'm fine with splitting the solution into two deliverables to mitigate the error in a more practical way and in a shorter time (does this work for you too, @crespocarlos ?).

Sure :).

IMO this will be easier to do

Error There was an error creating your ML job: {exact error message} learn more

About the Learn more link, this is the best I could find https://www.elastic.co/guide/en/machine-learning/8.14/ml-ad-overview.html. Perhaps this is enough?

iblancof commented 1 month ago

For clarity regarding the list of errors I mentioned, here's an example.

This is how it would look while keeping the same UI we currently have. Does it make sense to you?

Regarding the link I need to investigate a bit how are we dealing with similar situations in other scenarios to understand where to place it.

crespocarlos commented 1 month ago

Hey @iblancof , that looks much better! I honestly don't there is a need for a Learn more link in this case. From what I've seen, the docs don't have anything related to troubleshooting, but I'll leave that to @roshan-elastic .

iblancof commented 1 month ago

While reviewing the internal API docs for ML, I noticed that the example success response with errors in the Set up module endpoint doesn’t match the actual response. This mismatch is the root cause of the issue.

The code is set up to handle the data structure specified in the documentation, but that’s not what we’re receiving. That’s why we need to make adjustments on the UI side.

My point is that if we decide to fix it in the UI, we should also update the documentation to ensure it matches the actual data.

The only difference I see is that the docs reflect a success: false with error and my example is success:true with error. Not sure if we could have different error data structure depending on that. Nevermind, the screenshot I see in the initial description of the issue has success:false with the "real" structure.

Docs	Reality

crespocarlos commented 1 month ago

@iblancof Nice findings. Could you please link the documentation you're referring to?

roshan-elastic commented 1 month ago

Hey @iblancof - no need to apologise, completely my fault!

Love this! For now, without working with the ML team we're best off at least transparently providing the user with messages so they could share the error with support/google it. I think this is a good minimum solution for now so we should do this if we don't hear back from the ML team quite quickly.

I honestly don't there is a need for a Learn more link in this case. From what I've seen, the docs don't have anything related to troubleshooting

Yes, unless the ML team provide us with a help doc that we can link to which will help the user - let's work without pointing towards docs.

Let me reach out the ML team if there's something else we could do on this now or in the longer term.

FYI @arisonl is on PTO (PM for ML) but @peteharverson might be able to point us to a contact to discuss with on the ML team.

@peteharverson - Do you know of anyone in your team who might be able to advise us on this issue?

Quick recap

We currently don't pass the error message returned by the internal APIs to the user (see screenshot) : this is confusing for users as they have no idea how to troubleshoot
We're proposing transparently passing the error messages from the API to the end user (see screenshot) : This is so that users can at try and search for a solution or ask support for help on the specific errors.

What are we after? We'd like:

A way to present meaningful messages that the user can take action on without our team needing to 'map' particular errors to a message
Failing that, documentation that we can point users to which might have common error messages which could advise the customer what to do

iblancof commented 1 month ago

@iblancof Nice findings. Could you please link the documentation you're referring to?

Sure @crespocarlos!

I hadn't shared it before because it's internal. Here it is: https://docs.elastic.dev/ml-team/docs/ui/rest-api/ml-api#-set-up-module (the anchor does not work, you'll need to scroll a bit).

iblancof commented 1 month ago

Hey @roshan-elastic!

Since we haven't gotten a response about the link, how about we move forward with displaying the errors as shown in the screenshot? With that solution, we will be providing the user with more information than they are currently getting.

We can create separate issues for the link and for clearer user messages if the ML team decides to make any changes. I'm not sure who should be responsible for creating those issues if needed.

Wdyt?

roshan-elastic commented 1 month ago

Hey @iblancof - sounds good to me!

The PM we need is off so sounds sensible to proceed with the solution as proposed (i.e. forward the errors).

FYI @peteharverson @arisonl - just to make you aware we're going to present the error messages to the user instead of showing them 'unknown'.

In the future, perhaps the API could return back user-friendly messages they can take action on?

At least for now, they can ask support or try to Google/GPT to find the problem.

iblancof commented 1 month ago

While reviewing the internal API docs for ML, I noticed that the example success response with errors in the Set up module endpoint doesn’t match the actual response. This mismatch is the root cause of the issue.

The code is set up to handle the data structure specified in the documentation, but that’s not what we’re receiving. That’s why we need to make adjustments on the UI side.

My point is that if we decide to fix it in the UI, we should also update the documentation to ensure it matches the actual data.

Yesterday, it was confirmed that the ML team is actively working on updating their documentation to integrate with OpenAPI. This will eliminate the need for manual updates, ensuring everything stays up to date 🚀

elastic / kibana

[Infra UI] Error handling in anomaly detection job does not give user any meaningful guidance #170500

Description

Expectation