Closed roshan-elastic closed 1 month ago
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)
@vinaychandrasekhar FYI (cc @grabowskit)
Hey @smith,
I captured this as a separate issue - I don't think this is the high priority that the serverless issue is but wanted to capture it on the backlog.
As its related, wanted to share it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
While testing to reproduce the error, I encountered two different issues:
https://github.com/user-attachments/assets/4ef66f93-ae25-46a3-b15a-616bc11a1c81
1 - The error message that is received is not displayed correctly (the issue specified in this ticket)
It appears that the error expected in the response to be parsed and displayed in the UI should be like this:
"error" : { "msg": "error message here"}
However, it is received in a different format, so no message is displayed because it cannot parse it.
"error": {
"error": {
"root_cause": [
{
"type": "status_exception",
"reason": "[datafeed-kibana-metrics-ui-default-default-hosts_memory_usage] cannot retrieve field [system.memory.actual.used.pct] because it has no mappings"
}
],
"type": "status_exception",
"reason": "[datafeed-kibana-metrics-ui-default-default-hosts_memory_usage] cannot retrieve field [system.memory.actual.used.pct] because it has no mappings"
},
"status": 400
}
Questions
reason
field? Do we use the type
field to have user friendly messages? In that case what should it be?2 - It shows an error even though everything has been created correctly
From what I see, all jobs
come as success:true
and the datafeeds
as well, but one of them, the one with the error, has started:false
.
It seems that the UI returns an error because it finds it in the response (withstarted:false
), but apparently, everything has been executed correctly. If you reopen the view or visit the list of jobs, they all appear to be created.
This is the response:
{
"jobs": [
{
"id": "kibana-metrics-ui-default-default-hosts_memory_usage",
"success": true
},
{
"id": "kibana-metrics-ui-default-default-hosts_network_in",
"success": true
},
{
"id": "kibana-metrics-ui-default-default-hosts_network_out",
"success": true
}
],
"datafeeds": [
{
"id": "datafeed-kibana-metrics-ui-default-default-hosts_memory_usage",
"success": true,
"started": false,
"error": {
"error": {
"root_cause": [
{
"type": "status_exception",
"reason": "[datafeed-kibana-metrics-ui-default-default-hosts_memory_usage] cannot retrieve field [system.memory.actual.used.pct] because it has no mappings"
}
],
"type": "status_exception",
"reason": "[datafeed-kibana-metrics-ui-default-default-hosts_memory_usage] cannot retrieve field [system.memory.actual.used.pct] because it has no mappings"
},
"status": 400
}
},
{
"id": "datafeed-kibana-metrics-ui-default-default-hosts_network_in",
"success": true,
"started": true,
"awaitingMlNodeAllocation": false
},
{
"id": "datafeed-kibana-metrics-ui-default-default-hosts_network_out",
"success": true,
"started": true,
"awaitingMlNodeAllocation": false
}
],
"kibana": {}
}
Questions
datafeed
appears as not started while the related job
is marked as successful?jobs
are successful? (I don't know if that information is reliable because the datafeed
showing the error is also marked as successful)job
and datafeed
statuses and errors and understand what is happening to provide meaningful information to our users@roshan-elastic I forgot to mention you in the previous comment.
I shared two situations I came across while trying to reproduce the bug and I've left a few questions that I think you might be able to answer to help us move forward with the issue.
@iblancof thanks for the detailed evidence.
Which kind of message would we like to show? The raw one in the reason field? Do we use the type field to have user friendly messages? In that case what should it be?
I think it would be a friendlier message but somehow present the error details so that users can understand what happened and try to solve the problem. wdyt @roshan-elastic?
it shows an error even though everything has been created correctly
This UI is very simple. It could also tell users if one or more jobs had problems/or are stopped. Looks like in your recording that the memory datafeed is stopped. I don't know if users can investigate the root case from the ML UI.
Here is the doc explaining ML jobs and datafeeds https://www.elastic.co/guide/en/machine-learning/8.14/ml-ad-run-jobs.html#ml-ad-datafeeds
@crespocarlos thanks for your answer!
For the first point let's wait for Roshan's answer ✅
For the second point I realized it's exactly the same that happens in this issue that was mentioned in a previous comment here. Users get an error even if everything is created as expected.
I think we can leave that out of the scope since there's already an issue created for that specific case.
Thanks @iblancof - great flag!
Is there any telemetry data emitted when the error shows? I'm trying to figure out how to prioritise based on how often this happens.
You can check the 'kibana-browser' XHR request for telemetry to see if there are any kind of markers we can look for in the data to help me get the numbers on this:
https://github.com/user-attachments/assets/048dd63b-3e34-4d4f-abf4-9aa1ae7a00cd.mp4
I've come across this before but it seemed to be an edge case.
It needs to be fixed though - I'm trying to figure out if this is a bug fix we can prioritise later or if this is affecting a significant number of users' ability to onboard.
FYI I'd imagine a fix for this would probably be two parts:
Oh wait @iblancof - looks like this work is being picked up by you already as part of the maintenance backlog?
In which case, don't worry about telemetry! We can think about a fix now. Gimme a few mins
OK - had time to think.
I don't think we're equipped to create custom error messages for every possible reason an ML job would fail - I think the ML team should own that. I'm not sure we even have any documentation we could point towards.
What do you both think about presenting the error messages returned by the ML job API and then asking the ML team to make these user friendly?
From a user experience, here's what I'd expect to see in the error message:
Error {category of error} {high-level explanation} {granular error details} {link to docs where they can learn more about what to do}
e.g.
Error
Could not create ML job due to lack of data
You job you are trying to create does not have the relevant data. The fields fieldA
and fieldB
that the job relies on are not present. Please ensure you are collecting this data.
error: field A does not exist in index
With the current error message, I feel like the best we can do is:
Error There was an error creating your ML job: {exact error message} learn more
(where 'learn more' would point to a documentation page owned by the ML team explaining the various potential error messages that users could have).
At least then the user would be able to search online for the error message or raise a support ticket.
WDYT?
Hey @arisonl - see the video above.
Do you think your team has appetite to help present user friendly error messages in the ML job API response that we could send to users?
Thank you for taking the time to look into this @roshan-elastic!
I apologize for not being clearer about the priorities; I picked up the task to work on it as part of my onboarding, as Nathan suggested.
For my part, I'm fine with splitting the solution into two deliverables to mitigate the error in a more practical way and in a shorter time (does this work for you too, @crespocarlos ?).
I want to clarify that there could be more than one error since we're creating several jobs, and errors could occur in any of them. Therefore, we would display a list of errors.
Besides that, I'm wondering who I can ask about the link we want to put in "learn more." I understand it's a page that already exists. Sorry for the lack of context; I'm gradually gathering information.
hey @iblancof
For my part, I'm fine with splitting the solution into two deliverables to mitigate the error in a more practical way and in a shorter time (does this work for you too, @crespocarlos ?).
Sure :).
IMO this will be easier to do
Error There was an error creating your ML job: {exact error message} learn more
About the Learn more link, this is the best I could find https://www.elastic.co/guide/en/machine-learning/8.14/ml-ad-overview.html. Perhaps this is enough?
For clarity regarding the list of errors I mentioned, here's an example.
This is how it would look while keeping the same UI we currently have. Does it make sense to you?
Regarding the link I need to investigate a bit how are we dealing with similar situations in other scenarios to understand where to place it.
Hey @iblancof , that looks much better! I honestly don't there is a need for a Learn more link in this case. From what I've seen, the docs don't have anything related to troubleshooting, but I'll leave that to @roshan-elastic .
While reviewing the internal API docs for ML, I noticed that the example success response with errors in the Set up module endpoint doesn’t match the actual response. This mismatch is the root cause of the issue.
The code is set up to handle the data structure specified in the documentation, but that’s not what we’re receiving. That’s why we need to make adjustments on the UI side.
My point is that if we decide to fix it in the UI, we should also update the documentation to ensure it matches the actual data.
The only difference I see is that the docs reflect a Nevermind, the screenshot I see in the initial description of the issue has success: false
with error and my example is success:true
with error. Not sure if we could have different error data structure depending on that.success:false
with the "real" structure.
Docs | Reality |
---|---|
@iblancof Nice findings. Could you please link the documentation you're referring to?
Hey @iblancof - no need to apologise, completely my fault!
Love this! For now, without working with the ML team we're best off at least transparently providing the user with messages so they could share the error with support/google it. I think this is a good minimum solution for now so we should do this if we don't hear back from the ML team quite quickly.
I honestly don't there is a need for a Learn more link in this case. From what I've seen, the docs don't have anything related to troubleshooting
Yes, unless the ML team provide us with a help doc that we can link to which will help the user - let's work without pointing towards docs.
Let me reach out the ML team if there's something else we could do on this now or in the longer term.
FYI @arisonl is on PTO (PM for ML) but @peteharverson might be able to point us to a contact to discuss with on the ML team.
@peteharverson - Do you know of anyone in your team who might be able to advise us on this issue?
Quick recap
We currently don't pass the error message returned by the internal APIs to the user (see screenshot) : this is confusing for users as they have no idea how to troubleshoot
We're proposing transparently passing the error messages from the API to the end user (see screenshot) : This is so that users can at try and search for a solution or ask support for help on the specific errors.
What are we after? We'd like:
@iblancof Nice findings. Could you please link the documentation you're referring to?
Sure @crespocarlos!
I hadn't shared it before because it's internal. Here it is: https://docs.elastic.dev/ml-team/docs/ui/rest-api/ml-api#-set-up-module (the anchor does not work, you'll need to scroll a bit).
Hey @roshan-elastic!
Since we haven't gotten a response about the link, how about we move forward with displaying the errors as shown in the screenshot? With that solution, we will be providing the user with more information than they are currently getting.
We can create separate issues for the link and for clearer user messages if the ML team decides to make any changes. I'm not sure who should be responsible for creating those issues if needed.
Wdyt?
Hey @iblancof - sounds good to me!
The PM we need is off so sounds sensible to proceed with the solution as proposed (i.e. forward the errors).
FYI @peteharverson @arisonl - just to make you aware we're going to present the error messages to the user instead of showing them 'unknown'.
In the future, perhaps the API could return back user-friendly messages they can take action on?
At least for now, they can ask support or try to Google/GPT to find the problem.
While reviewing the internal API docs for ML, I noticed that the example success response with errors in the Set up module endpoint doesn’t match the actual response. This mismatch is the root cause of the issue.
The code is set up to handle the data structure specified in the documentation, but that’s not what we’re receiving. That’s why we need to make adjustments on the UI side.
My point is that if we decide to fix it in the UI, we should also update the documentation to ensure it matches the actual data.
Yesterday, it was confirmed that the ML team is actively working on updating their documentation to integrate with OpenAPI. This will eliminate the need for manual updates, ensuring everything stays up to date 🚀
Description
When there is an error creating a host anomaly detection job, the error messaging presents no information to the user to help them resolve the issue.
For example, the following error is presented when there aren't enough shards to create the job:
Example generic error message
API response
Example video of workflow
https://github.com/elastic/kibana/assets/117740680/26e1daa1-a7c1-4d30-a181-4ff80d883989
Expectation
It is expected that a user-friendly error state is returned with directions of how to resolve.