elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.48k stars 8.04k forks source link

[Obs AI Assistant] Improve LLM re-rank relevance and nuance #180118

Open miltonhultgren opened 3 months ago

miltonhultgren commented 3 months ago

Summary

Given a certain Knowledge base, we've seen the LLM fail to provide high scores for documents we believe are very relevant to the prompt. We suspect this may be partially due to the LLM not having examples of what is relevant and what isn't.

Another thing we see is that while we ask for a score within a range, the LLM tends to either rank very high (7) or very low (0). We suspect this is due to the LLM not understanding what a range/scale is and our prompt only mentions the high and low end (versus adding "3 being somewhat relevant, 5 being very relevant").

AC

Given the below knowledge base and prompts, we expect the LLM to:

Open questions

Test data

Current system prompt:

Given the following question, score the documents that are relevant to the question. on a scale from 0 to 7,
    0 being completely irrelevant, and 7 being extremely relevant. Information is relevant to the question if it helps in
    answering the question. Judge it according to the following criteria:

    - The document is relevant to the question, and the rest of the conversation
    - The document has information relevant to the question that is not mentioned,
      or more detailed than what is available in the conversation
    - The document has a high amount of information relevant to the question compared to other documents
    - The document contains new information not mentioned before in the conversation

Knowledge base:

[
  {
    id: 'github_elastic',
    text: `{"text":"Issue URL: https://github.com/elastic/demos/issues/3688\\n\\nThe cartservice occasionally encounters storage errors due to an unreliable network connection. \\n\\nThe errors typically indicate a failure to connect to Redis, as seen in the error message:\\n\\n'Status(StatusCode=\\"FailedPrecondition\\", Detail=\\"Can't access cart storage. \\nSystem.ApplicationException: Wasn't able to connect to redis \\n  at cartservice.cartstore.RedisCartStore.EnsureRedisConnected() in /usr/src/app/src/cartstore/RedisCartStore.cs:line 104 \\n  at cartservice.cartstore.RedisCartStore.EmptyCartAsync(String userId) in /usr/src/app/src/cartstore/RedisCartStore.cs:line 168')'.\\n\\nI just talked to the SRE team in Slack, this is a known issue and they have plans to implement retries as a quick fix and address the network issue later."}`,
    score: 10
  },
  {
    id: 'github',
    text: `{"text":"Issue URL: https://github.com/elastic/demos/issues/3688\\n\\nThe cartservice occasionally encounters storage errors due to an unreliable network connection. \\n\\nThe errors typically indicate a failure to connect to Redis, as seen in the error message:\\n\\n'Status(StatusCode=\\"FailedPrecondition\\", Detail=\\"Can't access cart storage. \\nSystem.ApplicationException: Wasn't able to connect to redis \\n  at cartservice.cartstore.RedisCartStore.EnsureRedisConnected() in /usr/src/app/src/cartstore/RedisCartStore.cs:line 104 \\n  at cartservice.cartstore.RedisCartStore.EmptyCartAsync(String userId) in /usr/src/app/src/cartstore/RedisCartStore.cs:line 168')'.\\n\\nI just talked to the SRE team in Slack, this is a known issue and they have plans to implement retries as a quick fix and address the network issue later."}`,
    score: 10
  },
  {
    id: 'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-apm-latency.md',
    text: `{"mode":"100644","path":"ai_assistant/runbooks/slos/slo-apm-latency.md","extension":".md","size":4798,"name":"slo-apm-latency.md","id":"elastic/observability-aiops/ai_assistant/runbooks/slos/slo-apm-latency.md","type":"blob","body":"# Handling SLO Burn Rate for High APM Latency in Checkout for in AI App **Description:** This runbook provides instructions for diagnosing and resolving alerts related to the Service Level Objective (SLO) burn rate for high APM latency on the Checkout in AI App service in the Elastic Stack. **Prerequisites:** - Access to the server where the Checkout in AI App service is hosted. - Basic knowledge of of APM, the Checkout in AI App service's architecture and the Elastic Stack. - Access to Kibana where APM data is indexed. - Access to SLO and APM data for Checkout in AI App in Kibana. **Steps:** **1. Check SLO Dashboard in Kibana:** - Go to the [SLO Dashboard for High APM Latency in Checkout](https://fd0d609de698443cbf2e151f920714ad.europe-west1.gcp.cloud.es.io:9243/app/observability/slos/d7605640-8f98-11ee-b80a-893cbe41d028) to verify the burn rate alert. The dashboard should provide a visual representation of the SLI of high APM latency for the \\"Checkout in AI App\\" service. Identify the time period when the APM latency started to increase. **2. Check APM in Kibana:** - Navigate to [Checkout in AI App service](https://fd0d609de698443cbf2e151f920714ad.europe-west1.gcp.cloud.es.io:9243/app/apm/services/checkoutService/overview?comparisonEnabled=true&environment=ENVIRONMENT_ALL&kuery=&latencyAggregationType=avg&offset=1d&rangeFrom=now-3h&rangeTo=now&serviceGroup=&transactionType=request) in APM in Kibana. - Check the latency graph for any noticeable spikes or high latency periods. - If there is a high latency issue, it will most likely be visible here. **3. Identify the Cause:** - Look for clues in the APM data that may indicate the root cause of the high latency. - Common causes include: - Slow Transactions: The APM app can help you identify specific transactions that may be causing high latency in your service - Application Errors: The APM app can help you identify errors in your service. You can view the error rate for your service and see the details of individual errors. This can help you identify if errors in your service are causing high latency. - Heavy Load: The APM app can help you identify if your service is under heavy load. You can view the transaction rate for your service to see if there's a spike in traffic. You can also view the response times for your transactions to see if they're increasing, which could indicate that your service is struggling to handle the load. - Dependencies on Downstream Services: The Service Map in the APM app can help you identify dependencies on downstream services. If a downstream service is slow or experiencing errors, it could affect the performance of your service. You can view the performance of downstream services in the Service Map to identify any issues. - Resource Exhaustion: The instances where the service is running might be running out of resources like CPU, memory, or disk space. Explore the related Instances to find - Network Issues: There could be network issues between the different layers of your application. For example, if your service needs to communicate with a database or another service to process a request, network issues could slow down this communication and increase latency. **4. Verify APM Configuration:** - Review your APM configuration to ensure it's correctly configured to monitor the Checkout in Checkout in AI App service. - Verify that the APM server is correctly receiving data from the Checkout in Checkout in AI App service. **5. Check Network Connectivity:** - Verify that there are no network connectivity issues between the Checkout in Checkout in AI App service and the APM server. - Ensure that firewalls or security groups are not blocking traffic. **6. Restart Application:** - If no specific issues are found, try restarting the Checkout in Checkout in AI App service to see if it resolves the high latency issue. **7. Document Findings:** - Record the details of the issue, actions taken, and the resolution in a centralized documentation system for future reference in to our incident management system. **8. Preventive Measures:** - Consider implementing preventive measures such as load balancing, redundancy, and automated failover to minimize the impact of future high latency issues. **9. Escalation:** - If the issue persists and cannot be resolved through the steps in this runbook, escalate it to relevant team members or system administrators with expertise in APM and server infrastructure. This runbook provides a structured approach to diagnosing and resolving high APM latency issues for the Checkout in Checkout in AI App service. It helps ensure that troubleshooting is systematic and that relevant information is documented for future reference."}`,
    score: 10
  },
  {
    id: 'elastic/observability-aiops/ai_assistant/runbooks/slos/cartservice-runbook.md',
    text: '{"mode":"100644","path":"ai_assistant/runbooks/slos/cartservice-runbook.md","extension":".md","size":165,"name":"cartservice-runbook.md","id":"elastic/observability-aiops/ai_assistant/runbooks/slos/cartservice-runbook.md","type":"blob","body":"This is the runbook for the cartservice-otel error. If the cartservice is experiencing errors, do the following: - call Luca, he will fix it - grab a cup of coffee"}',
    score: 10
  },
  {
    id: 'github_kb',
    text: '{"text":"Hello Assistant, when explaining an error or exception and you find a relevant github issue, make sure to include the link to it in your response and summarize its contents.\\n\\ncartservice-otel error"}',
    score: 10
  },
  {
    id: 'apm',
    text: '{"text":"When asked:\\nDid this incident have any impact on my business on June 7 2022? Look at the checkoutService?\\n\\nUse the get_apm_services list function followed by the get_apm_timeseries function with the\\"service.environment\\":  \\"prod\\". Never use the \\"service.environment\\": \\"production\\" in either function"}',
    score: 10
  },
  {
    id: 'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-apm-availability.md',
    text: `{"mode":"100644","path":"ai_assistant/runbooks/slos/slo-apm-availability.md","extension":".md","size":5481,"name":"slo-apm-availability.md","id":"elastic/observability-aiops/ai_assistant/runbooks/slos/slo-apm-availability.md","type":"blob","body":"# Handling SLO Burn Rate for Low APM Availability in Checkout for AI App **Description:** This runbook provides instructions for diagnosing and resolving alerts related to the Service Level Objective (SLO) burn rate for low APM availability on the AI App service in the Elastic Stack. In the context of Application Performance Monitoring (APM), availability is a measure of the proportion of successful requests to the total number of requests made. This measure gives an idea of how often the application is able to successfully process requests. A high availability rate indicates that the application is functioning as expected most of the time, while a low availability rate could indicate issues with the application. Unsuccessfully processed requests are typically represented by HTTP status codes in the 5xx range. These status codes indicate server errors, meaning the server failed to fulfill an apparently valid request. These status codes indicate that the server was unable to process the request, which is why they are considered as unsuccessful requests. **Prerequisites:** - Access to the server where the AI App service is hosted. - Basic knowledge of APM, the AI App service's architecture and the Elastic Stack. - Access to Kibana where APM data is indexed. - Access to SLO and APM data for AI App in Kibana. **Steps:** **1. Check SLO Dashboard in Kibana:** - Go to the [SLO Dashboard Low APM Availability in Checkout](https://fd0d609de698443cbf2e151f920714ad.europe-west1.gcp.cloud.es.io:9243/app/observability/slos/19bae0f0-8f99-11ee-b80a-893cbe41d028) to verify the burn rate alert. The dashboard should provide a visual representation of the SLI of low APM availability for the Checkout in AI App service. Identify the time period when the APM availability started to decrease. **2. Check APM in Kibana:** - Navigate to [Checkout AI App service](https://fd0d609de698443cbf2e151f920714ad.europe-west1.gcp.cloud.es.io:9243/app/apm/services/checkoutService/overview?comparisonEnabled=true&environment=ENVIRONMENT_ALL&kuery=&latencyAggregationType=avg&offset=1d&rangeFrom=now-3h&rangeTo=now&serviceGroup=&transactionType=request) in APM in Kibana. - Check the availability graph for any noticeable drops or low availability periods. - If there is a low availability issue, it will most likely be visible here. **3. Identify the Cause:** - Look for clues in the APM data that may indicate the root cause of the low availability. - Steps to follow: - Identify the Error: The first step is to identify the specific error that is causing the decrease in availability. This can be done by examining the error messages and status codes in the APM data. - Analyze the Error: Once the error has been identified, the next step is to analyze it. This involves understanding what the error means and how it might be affecting the application's performance. For example, a 503 Service Unavailable error might indicate that the server is overloaded or down for maintenance. - Trace the Error: After understanding the error, the next step is to trace it back to its source. This can be done by examining the application's transaction data in APM. This data can show which transactions were affected by the error and can help identify where in the application the error occurred. - Examine the Application's Dependencies: If the error is related to a service that the application depends on (like a database or an external API), it's important to examine the performance of these dependencies. APM's Service Map feature can be useful for this. It shows how different services are connected and can help identify if a downstream service is causing the error. - Check the Application's Resource Usage: If the application is running out of resources (like CPU, memory, or disk space), this could be causing the error. APM can provide information about the application's resource usage, which can help identify if this is the problem. **4. Verify APM Configuration:** - Review your APM configuration to ensure it's correctly configured to monitor the Checkout in AI App service. - Verify that the APM server is correctly receiving data from the Checkout in AI App service. **5. Check Network Connectivity:** - Verify that there are no network connectivity issues between the Checkout in AI App service and the APM server. - Ensure that firewalls or security groups are not blocking traffic. **6. Restart Application:** - If no specific issues are found, try restarting the Checkout in AI App service to see if it resolves the low availability issue. **7. Document Findings:** - Record the details of the issue, actions taken, and the resolution in a centralized documentation system for future reference in to our incident management system. **8. Preventive Measures:** - Consider implementing preventive measures such as load balancing, redundancy, and automated failover to minimize the impact of future low availability issues. **9. Escalation:** - If the issue persists and cannot be resolved through the steps in this runbook, escalate it to relevant team members or system administrators with expertise in APM and server infrastructure. This runbook provides a structured approach to diagnosing and resolving low APM availability issues for the Checkout in AI App service. It helps ensure that troubleshooting is systematic and that relevant information is documented for future reference."}`,
    score: 10
  },
  {
    id: 'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-hosts-high-cpu-usage.md',
    text: '{"mode":"100644","path":"ai_assistant/runbooks/slos/slo-hosts-high-cpu-usage.md","extension":".md","size":3309,"name":"slo-hosts-high-cpu-usage.md","id":"elastic/observability-aiops/ai_assistant/runbooks/slos/slo-hosts-high-cpu-usage.md","type":"blob","body":"# Handling SLO Burn Rate for Disk Space Usage **Description:** This runbook provides instructions for diagnosing and resolving alerts related to the Service Level Objective (SLO) burn rate for the metric of remaining disk space reaching 90% (`system.filesystem.used.pct`) on the hosts infrastructure of the Elastic AI App. **Prerequisites:** - Access to the server where the Elastic AI App is hosted. - Basic knowledge of the Elastic AI App\'s architecture and the Elastic Stack. - Access to Kibana where application metrics and logs are indexed. - Access to SLO and the \\"Elastic AI App Metrics\\" dashboard in Kibana. **Steps:** **1. Check SLO Dashboard in Kibana:** - Go to the [SLO Disk Space Dashboard](https://fd0d609de698443cbf2e151f920714ad.europe-west1.gcp.cloud.es.io:9243/app/observability/slos/3029c9b0-8f98-11ee-b80a-893cbe41d028) to verify the burn rate alert. The dashboard should provide a visual representation of the SLI of `system.filesystem.used.pct < 80` for the hosts infrastructure of the app. Identify the time period when the disk space usage started to increase. **2. Check \\"Elastic AI App Metrics\\" Dashboard in Kibana:** - Go to the [Elastic AI App Metrics](https://fd0d609de698443cbf2e151f920714ad.europe-west1.gcp.cloud.es.io:9243/app/dashboards#/view/734b7920-8f91-11ee-b80a-893cbe41d028?_g=(filters:!(),refreshInterval:(pause:!f,value:60000),time:(from:now-7d,to:now))) dashboard which tracks the metrics `system.cpu.user.pct`, `system.load.1`, `system.memory.actual.used.pct`, `system.filesystem.used.pct`, `host.network.ingress.bytes` and `host.network.egress.bytes`. - Use the Anomaly Detection embeddable that shows anomalies to identify possible anomalies in the rest of the metrics. **3. Identify the Cause:** - Look for clues in the metrics that may indicate the root cause of the increased disk space usage. Common causes include: - High CPU usage (`system.cpu.user.pct`). - High system load (`system.load.1`). - High memory usage (`system.memory.actual.used.pct`). - High network ingress or egress (`host.network.ingress.bytes` and `host.network.egress.bytes`). **4. Check Resource Usage and Network Connectivity:** - Monitor the resource usage (CPU, memory, disk) on the server to ensure there are no resource bottlenecks causing the high disk space usage. - Verify that there are no network connectivity issues that might be causing high network ingress or egress. **5. Document Findings:** - Record the details of the issue, actions taken, and the resolution in a centralized documentation system for future reference in our incident management system. **6. Preventive Measures:** - Consider implementing preventive measures such as load balancing, redundancy, and automated failover to minimize the impact of future SLO burn rate alerts. **7. Escalation:** - If the issue persists and cannot be resolved through the steps in this runbook, escalate it to relevant team members or system administrators with expertise in the Elastic AI App and the Elastic Stack. This runbook provides a structured approach to diagnosing and resolving SLO burn rate alerts for disk space usage in your system. It helps ensure that troubleshooting is systematic and that relevant information is documented for future reference."}',
    score: 10
  },
  {
    id: 'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-productCatalogService.md',
    text: `{"mode":"100644","path":"ai_assistant/runbooks/slos/slo-productCatalogService.md","extension":".md","size":5030,"name":"slo-productCatalogService.md","id":"elastic/observability-aiops/ai_assistant/runbooks/slos/slo-productCatalogService.md","type":"blob","body":"# Runbook for High Latency Alert in APM productCatalogService **Description:** This runbook provides instructions for diagnosing and resolving alerts related to high latency in the productCatalogService in the Hipster Shop application using APM. **Prerequisites:** - Access to the server where the productCatalogService is hosted. - Basic knowledge of APM, the Hipster Shop application's architecture, and the Elastic Stack. - Access to Kibana where APM data is indexed. - Access to APM data for productCatalogService in Kibana. - When asked about productCatalogService, APM functions should be used directly without calling get_dataset_info **Steps:** 1. **Check SLO Dashboard in Kibana:** - Go to the [SLO Dashboard for APM latency in productCatalogService](https://fd0d609de698443cbf2e151f920714ad.europe-west1.gcp.cloud.es.io:9243/app/observability/slos/1b010960-9045-11ee-b80a-893cbe41d028) to verify the burn rate alert. The dashboard should provide a visual representation of the SLI of the latency for the productCatalogService service. Identify the time period when the latency started to increase. 2. **Check Latency:** - Ask the AI Assistant to get a time series of the latency for the productCatalogService. Check the latency graph for any noticeable spikes or high latency periods, rf there is a high latency issue, it will most likely be visible here. - Navigate to [productCatalogService](https://fd0d609de698443cbf2e151f920714ad.europe-west1.gcp.cloud.es.io:9243/app/apm/services/productCatalogService/overview?comparisonEnabled=true&environment=ENVIRONMENT_ALL&kuery=&latencyAggregationType=avg&offset=1d&rangeFrom=now-15m&rangeTo=now&serviceGroup=&transactionType=request) in APM in Kibana to see more details of the service. 3. **Identify the Cause:** - Look for clues in the APM data that may indicate the root cause of the high latency. - Common causes include: - Slow Transactions: The APM app can help you identify specific transactions that may be causing high latency in your service. - Use the AI Assistant to find specific transactions that may be contributing to a higher latency - Application Errors: The APM app can help you identify errors in your service. You can view the error rate for your service and see the details of individual errors. This can help you identify if errors in your service are causing high latency. - Use the AI Assistant to get detailed information about any errors that have occurred in the service. - Heavy Load: The APM app can help you identify if your service is under heavy load. You can view the transaction rate for your service to see if there's a spike in traffic. You can also view the response times for your transactions to see if they're increasing, which could indicate that your service is struggling to handle the load. - Use the AI Assistant to get a summary of the service, including transaction and error rates. - Dependencies on Downstream Services: The Service Map in the APM app can help you identify dependencies on downstream services. If a downstream service is slow or experiencing errors, it could affect the performance of your service. You can view the performance of downstream services in the Service Map to identify any issues. - Use the AI Assistant to identify any downstream services that the productCatalogService depends on, and check their performance. - Resource Exhaustion: The instances where the service is running might be running out of resources like CPU, memory, or disk space. Explore the related Instances to find out. 4. **Verify APM Configuration:** - Review your APM configuration to ensure it's correctly configured to monitor the \\"productCatalogService\\". - Verify that the APM server is correctly receiving data from the \\"productCatalogService\\". 5. **Check Network Connectivity:** - Verify that there are no network connectivity issues between the \\"productCatalogService\\" and the APM server. - Ensure that firewalls or security groups are not blocking traffic. 6. **Restart Application:** - If no specific issues are found, try restarting the \\"productCatalogService\\" to see if it resolves the high latency issue. 7. **Document Findings:** - Record the details of the issue, actions taken, and the resolution in a centralized documentation system for future reference. 8. **Preventive Measures:** - Consider implementing preventive measures such as load balancing, redundancy, and automated failover to minimize the impact of future high latency issues. 9. **Escalation:** - If the issue persists and cannot be resolved through the steps in this runbook, escalate it to relevant team members or system administrators with expertise in APM and server infrastructure. This runbook provides a structured approach to diagnosing and resolving high latency issues for the \\"productCatalogService\\". It helps ensure that troubleshooting is systematic and that relevant information is documented for future reference."}`,
    score: 10
  }
]

Prompt:

I'm an SRE. I am looking at an exception and trying to understand what it means. Your task is to describe what the error means and what it could be caused by. The error occurred on a service called cartservice-otel, which is a service written in dotnet. The runtime version is . The request it occurred for is called oteldemo.CartService/EmptyCart.

Scores:

{
  github_elastic: 7,
  github: 7,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-apm-latency.md': 0,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/cartservice-runbook.md': 1,
  github_kb: 0,
  apm: 0,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-apm-availability.md': 0,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-hosts-high-cpu-usage.md': 0,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-productCatalogService.md': 0
}

Prompt:

Is there a runbook for the cartservice-otel?

Scores

{
  github_elastic: 1,
  github: 1,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-apm-latency.md': 0,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/cartservice-runbook.md': 7,
  github_kb: 0,
  apm: 0,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-apm-availability.md': 0,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-hosts-high-cpu-usage.md': 0,
  'elastic/observability-aiops/ai_assistant/runbooks/slos/slo-productCatalogService.md': 0
}
elasticmachine commented 3 months ago

Pinging @elastic/obs-knowledge-team (Team:obs-knowledge)

miltonhultgren commented 3 months ago

@marcogavaz @almudenasanz Would it be possible for you to help us answer the open question about the sample knowledge base?