elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.62k stars 8.22k forks source link

[Bug] JavaScript Errors on the UX page has inconsistent fields #78926

Closed vigneshshanmugam closed 4 years ago

vigneshshanmugam commented 4 years ago

Kibana version: Master branch

Browser version: All Browsers

Original install method (e.g. download page, yum, from source, etc.): from source

Describe the bug: User Experience page JavaScript Errors tab has inconsistent error fields/group compared to the APM ones.

Steps to reproduce:

  1. Click on the User Experience tab under Observability
  2. Scroll down to the JavaScript Errors section
  3. Compare the values with the existing APM errors page for same service

Expected behaviour: Both of the pages must have same consistent errors and filtering on the URL in the UX App should show correct percentage of errors.

Screenshots (if relevant):

Screenshot 2020-09-30 at 12 17 19
elasticmachine commented 4 years ago

Pinging @elastic/uptime (Team:uptime)

shahzad31 commented 4 years ago

Did you apply any filter for it or it appeared as you landed on the page?

vigneshshanmugam commented 4 years ago

The Error panel doesn't seem to be fixed still. Reopening again.

Screenshot 2020-10-15 at 16 28 09
shahzad31 commented 4 years ago

Can you please post the full steps to reproduce the issue? Full screenshot will help. Which filter/service etc is selected.

On Thu, Oct 15, 2020, 5:40 PM Vignesh Shanmugam notifications@github.com wrote:

Reopened #78926 https://github.com/elastic/kibana/issues/78926.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/elastic/kibana/issues/78926#event-3882284191, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2X3QNAEZYS5BBFHB2DWSTSK4JWFANCNFSM4R645GXQ .

vigneshshanmugam commented 4 years ago

Sure

  1. Go to User Experience App in Kibana. Check the error rate -> all looks good

    Screenshot 2020-10-15 at 18 24 05
  2. Type /blog/* in the UL filter and hit Enter.

    Screenshot 2020-10-15 at 18 24 36
Screenshot 2020-10-15 at 18 24 23
  1. You will see the Error rate going above 100%.
Kerry350 commented 4 years ago

I've been looking into this, it would seem that totalErrorPages or totalPageViews is calculated incorrectly, but it's not immediately obvious to me which, or why. This is my first time working with APM data in any real fashion, so something might be super obvious to someone else.

The UI side calculation for the error rate seems okay: ((data?.totalErrorPages ?? 0) / totalPageViews) * 100, which would suggest one (or both) of these values is incorrect. I've detailed the various ES queries below (actually these are a step before hitting ES, but apm.events sets the appropriate index and and sets the correct processor.event field).

Params passed to `apmEventClient.search` for `getErrorGroups` (errors tab of APM), for comparison. ``` json { "apm":{ "events":[ "error" ] }, "body":{ "query":{ "bool":{ "filter":[ { "term":{ "service.name":"elastic-co-frontend" } }, { "range":{ "@timestamp":{ "gte":1603204300333, "lte":1603204600333, "format":"epoch_millis" } } } ] } }, "aggs":{ "error_groups":{ "terms":{ "field":"error.grouping_key", "size":500, "order":{ "_count":"desc" } }, "aggs":{ "sample":{ "top_hits":{ "_source":[ "error.log.message", "error.exception.message", "error.exception.handled", "error.exception.type", "error.culprit", "error.grouping_key", "@timestamp" ], "sort":[ { "@timestamp":"desc" } ], "size":1 } } } } }, "size":0 } } ```
Params passed to `apmEventClient.search` for `getJsErrors` (JS errors on UX tab), without a filter. ``` json { "apm":{ "events":[ "error" ] }, "body":{ "query":{ "bool":{ "filter":[ { "range":{ "@timestamp":{ "gte":1603204443543, "lte":1603204743543, "format":"epoch_millis" } } }, { "term":{ "agent.name":"rum-js" } }, { "term":{ "transaction.type":"page-load" } }, { "term":{ "service.language.name":"javascript" } }, { "terms":{ "service.name":[ "elastic-co-frontend" ] } } ] } }, "size":0, "track_total_hits":true, "aggs":{ "totalErrorGroups":{ "cardinality":{ "field":"error.grouping_key" } }, "totalErrorPages":{ "cardinality":{ "field":"transaction.id" } }, "errors":{ "terms":{ "field":"error.grouping_key", "size":500 }, "aggs":{ "bucket_truncate":{ "bucket_sort":{ "size":5, "from":0 } }, "sample":{ "top_hits":{ "_source":[ "error.exception.message", "error.exception.type", "error.grouping_key", "@timestamp" ], "sort":[ { "@timestamp":"desc" } ], "size":1 } } } } } } } ```
With `URL` filter: ``` json { "apm":{ "events":[ "error" ] }, "body":{ "query":{ "bool":{ "filter":[ { "range":{ "@timestamp":{ "gte":1603276703829, "lte":1603277603830, "format":"epoch_millis" } } }, { "term":{ "agent.name":"rum-js" } }, { "term":{ "transaction.type":"page-load" } }, { "term":{ "service.language.name":"javascript" } }, { "terms":{ "service.name":[ "elastic-co-frontend" ] } }, { "wildcard":{ "url.full":{ "value":"*/blog/**" } } } ] } }, "size":0, "track_total_hits":true, "aggs":{ "totalErrorGroups":{ "cardinality":{ "field":"error.grouping_key" } }, "totalErrorPages":{ "cardinality":{ "field":"transaction.id" } }, "errors":{ "terms":{ "field":"error.grouping_key", "size":500 }, "aggs":{ "bucket_truncate":{ "bucket_sort":{ "size":5, "from":0 } }, "sample":{ "top_hits":{ "_source":[ "error.exception.message", "error.exception.type", "error.grouping_key", "@timestamp" ], "sort":[ { "@timestamp":"desc" } ], "size":1 } } } } } } } ```
Params passed to `apmEventClient.search` for `getClientMetrics` (Page views on UX tab), without filter ``` json { "apm":{ "events":[ "transaction" ] }, "body":{ "query":{ "bool":{ "filter":[ { "range":{ "@timestamp":{ "gte":1603209976807, "lte":1603210036808, "format":"epoch_millis" } } }, { "term":{ "transaction.type":"page-load" } }, { "exists":{ "field":"transaction.marks.navigationTiming.fetchStart" } }, { "terms":{ "service.name":[ "elastic-co-frontend" ] } } ] } }, "size":0, "aggs":{ "pageViews":{ "value_count":{ "field":"transaction.duration.us" } }, "backEnd":{ "percentiles":{ "field":"transaction.marks.agent.timeToFirstByte", "percents":[ 50 ], "hdr":{ "number_of_significant_value_digits":3 } } }, "domInteractive":{ "percentiles":{ "field":"transaction.marks.agent.domInteractive", "percents":[ 50 ], "hdr":{ "number_of_significant_value_digits":3 } } } } } } ```
With `URL` filter: ``` json { "apm":{ "events":[ "transaction" ] }, "body":{ "query":{ "bool":{ "filter":[ { "range":{ "@timestamp":{ "gte":1603276703829, "lte":1603277603830, "format":"epoch_millis" } } }, { "term":{ "transaction.type":"page-load" } }, { "exists":{ "field":"transaction.marks.navigationTiming.fetchStart" } }, { "wildcard":{ "url.full":{ "value":"*/blog/**" } } }, { "terms":{ "service.name":[ "elastic-co-frontend" ] } } ] } }, "size":0, "aggs":{ "pageViews":{ "value_count":{ "field":"transaction.duration.us" } }, "backEnd":{ "percentiles":{ "field":"transaction.marks.agent.timeToFirstByte", "percents":[ 50 ], "hdr":{ "number_of_significant_value_digits":3 } } }, "domInteractive":{ "percentiles":{ "field":"transaction.marks.agent.domInteractive", "percents":[ 50 ], "hdr":{ "number_of_significant_value_digits":3 } } } } } } ```

totalErrorPages is derived from a cardinality aggregation on transaction.id, and totalPageViews is derived from a value_count aggregation on transaction.duration.us. These derivations seem okay (although I am curious why the page views don't use distinct transaction.id values?).

Total errors gets based off the hit count for the outer query for getJsErrors and this value looks to be accurate, it correlates with what the APM > Errors tab has for sum(occurences) in the table.

There are possibly some problems that could arise from these queries, but they don't seem to actually be causing any problems in this case. 1) Wildcard patterns start with a *, from the docs: Avoid beginning patterns with * or ?. This can increase the iterations needed to find matching terms and slow search performance. and 2) Cardinality aggregation counts are approximate.

There is some thinking that the URL filter is potentially not working properly, but as far as I can see it's applied correctly in both queries and url.full looks to be the field we should be using. I think maybe the URL filtering is a red herring (and was correctly fixed in the PR originally set to address this). */blog/* in this instance may well just be surfacing incorrect values easily.

Thinking more on it I do wonder if ((data?.totalErrorPages ?? 0) / totalPageViews) * 100 is actually the correct calculation. The totalErrorPages query pulls from the errors index, and filters on processor.event set to error. And the totalPageViews query pulls from the transactions index, and filters on processor.event set to transaction. So we could say this is error events / transaction events, for a true error percentage rate, should this not be error events / total events (total events I guess being the sum of errors and transactions)?

My knowledge of exactly what the RUM agent sends and when in the course of a page load type is very murky. It would be helpful to know exactly which events are sent - transaction vs error - during the page load. I see that for APM > Transaction error rate the outcome field of the transaction events is used for determination, although this is obviously different to "Page views that errored".

hmdhk commented 4 years ago

@Kerry350 , thanks for the great write up.

The calculation ((data?.totalErrorPages ?? 0) / totalPageViews) * 100 seems correct to me. We shouldn't use the total events (sum of error and transaction events) for totalPageViews.

Furthermore, the queries seems correct as well. I agree using cardinality of transaction.id might be better compared to transaction.duration.us (I'm not sure why we didn't use that though), although I assume it would yield the same result (might be worth checking).

If all the above is correct, then there might be a problem in the data, but I'm not sure what could cause this issue.

hmdhk commented 4 years ago

Having another look at the queries I've found these two issues:

  1. checking for existence of transaction.marks.navigationTiming.fetchStart would reduce the number of total pages considered in the error rate. This condition should be removed for calculating error rate but I think it's necessary for other aggregations :
                {
                   "exists":{
                      "field":"transaction.marks.navigationTiming.fetchStart"
                   }
                }
  1. Using cardinality instead of value_count actually results in different values. This might be due to using estimation for value_count.

Applying the above two fixes to the queries posted above. I get totalErrorPages: 284 and pageViews: 284, which would result in error rate of 100% while before I was getting 138% . These two issues should be fixed but there might be other issues as well.

Kerry350 commented 4 years ago

@jahtalab Thank you for looking at this.

checking for existence of transaction.marks.navigationTiming.fetchStart would reduce the number of total pages considered in the error rate. This condition should be removed for calculating error rate but I think it's necessary for other aggregations

I must admit I skirted past this as there was a comment in the code above it saying Adding this filter to cater for some inconsistent rum data not available on aggregated transactions, so I figured it was important / needed for everything. I'll remove this from the top level filter, so that it doesn't affect pageViews, and instead add an extra filter aggregation, the percentile aggs can then be nested under there.

Using cardinality instead of value_count actually results in different values. This might be due to using estimation for value_count.

Oh, okay. I assume to use cardinality we'd want to switch transaction.duration.us for transaction.id as we'd now be dealing with distinct values, rather than number of extracted values?

shahzad31 commented 4 years ago

To get page views count, I think a better solution I have been keeping in mind to refactor this, is to use track_total_hits on query and just use hits total value. That will be more accurate here.

On Wed, Oct 21, 2020 at 6:20 PM Kerry Gallagher notifications@github.com wrote:

@jahtalab https://github.com/jahtalab Thank you for looking at this.

checking for existence of transaction.marks.navigationTiming.fetchStart would reduce the number of total pages considered in the error rate. This condition should be removed for calculating error rate but I think it's necessary for other aggregations

I must admit I skirted past this as there was a comment in the code above it saying Adding this filter to cater for some inconsistent rum data not available on aggregated transactions, so I figured it was important / needed for everything. I'll remove this from the top level filter, so that it doesn't affect pageViews, and instead add an extra filter aggregation, the percentile aggs can then be nested under there.

Using cardinality instead of value_count actually results in different values. This might be due to using estimation for value_count.

Oh, okay. I assume to use cardinality we'd want to switch transaction.duration.us for transaction.id as we'd now be dealing with distinct values, rather than number of extracted values?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/elastic/kibana/issues/78926#issuecomment-713694056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2X3QK53SJ2567NMSEMDTDSL4C5DANCNFSM4R645GXQ .

hmdhk commented 4 years ago

@Kerry350 , that's correct using cardinality on transaction.id is the option I tried (forgot to mention that).

Kerry350 commented 4 years ago

Great, thanks both 👍

Kerry350 commented 4 years ago

I've opened this draft PR #81512 which adds the fixes discussed here (making sure transaction.marks.navigationTiming.fetchStart doesn't exclude documents from the page views value, and using the hits value of the outer query to determine the page views rather than an aggregation).

Whilst this does improve the error rate, I'm still seeing a higher than 100% error rate. In the specific range I'm looking at the rate decreased from 108% to 106%, which was a discrepancy of around 50 documents (that would make it 100%, so still probably inaccurate even then).

After this the query that handles page views looks like so (the errors one remains unchanged), this is the query that actually hits ES after apmEventClient.search is done:

{
   "index":[
      "apm-*"
   ],
   "body":{
      "query":{
         "bool":{
            "filter":[
               {
                  "range":{
                     "@timestamp":{
                        "gte":1603376807878,
                        "lte":1603379689350,
                        "format":"epoch_millis"
                     }
                  }
               },
               {
                  "term":{
                     "transaction.type":"page-load"
                  }
               },
               {
                  "wildcard":{
                     "url.full":{
                        "value":"*/blog/**"
                     }
                  }
               },
               {
                  "terms":{
                     "service.name":[
                        "elastic-co-frontend"
                     ]
                  }
               },
               {
                  "terms":{
                     "processor.event":[
                        "transaction"
                     ]
                  }
               },
               {
                  "range":{
                     "observer.version_major":{
                        "gte":7
                     }
                  }
               }
            ]
         }
      },
      "size":0,
      "track_total_hits":true,
      "aggs":{
         "backEnd":{
            "percentiles":{
               "field":"transaction.marks.agent.timeToFirstByte",
               "percents":[
                  50
               ],
               "hdr":{
                  "number_of_significant_value_digits":3
               }
            }
         },
         "domInteractive":{
            "percentiles":{
               "field":"transaction.marks.agent.domInteractive",
               "percents":[
                  50
               ],
               "hdr":{
                  "number_of_significant_value_digits":3
               }
            }
         }
      }
   },
   "ignore_throttled":true
}

At this point, bar basic time and type filters, the queries are fairly lean (insofar as what actually derives page views, ignoring the percentiles stuff).

Two things:

@jahtalab I'm at a bit of a loss at this point as to where the inaccuracy stems from 🤔

shahzad31 commented 4 years ago

@jahtalab @vigneshshanmugam there is definitely data inconsistency.

There are errors events with transaction.type page-load , which doesn't have corresponding page load transaction events,

for example on https://dev-next-oblt.elastic.dev/

if you search with transaction.id : "430bfbfb4d8d30fb" , it has an error event but there is no transaction event.

This is why we are getting more error events compared to transaction events.

Is this an expected behavior? if not, definitely a bug in apm-server or rum agent.

paulb-elastic commented 4 years ago

As discussed on the call the fix for this that we will get in to the 7.10 release will be: