Closed vigneshshanmugam closed 4 years ago
Pinging @elastic/uptime (Team:uptime)
Did you apply any filter for it or it appeared as you landed on the page?
The Error panel doesn't seem to be fixed still. Reopening again.
Can you please post the full steps to reproduce the issue? Full screenshot will help. Which filter/service etc is selected.
On Thu, Oct 15, 2020, 5:40 PM Vignesh Shanmugam notifications@github.com wrote:
Reopened #78926 https://github.com/elastic/kibana/issues/78926.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/elastic/kibana/issues/78926#event-3882284191, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2X3QNAEZYS5BBFHB2DWSTSK4JWFANCNFSM4R645GXQ .
Sure
Go to User Experience App
in Kibana. Check the error rate -> all looks good
Type /blog/*
in the UL filter and hit Enter
.
I've been looking into this, it would seem that totalErrorPages
or totalPageViews
is calculated incorrectly, but it's not immediately obvious to me which, or why. This is my first time working with APM data in any real fashion, so something might be super obvious to someone else.
The UI side calculation for the error rate seems okay: ((data?.totalErrorPages ?? 0) / totalPageViews) * 100
, which would suggest one (or both) of these values is incorrect. I've detailed the various ES queries below (actually these are a step before hitting ES, but apm.events
sets the appropriate index and and sets the correct processor.event
field).
totalErrorPages
is derived from a cardinality aggregation on transaction.id
, and totalPageViews
is derived from a value_count
aggregation on transaction.duration.us
. These derivations seem okay (although I am curious why the page views don't use distinct transaction.id
values?).
Total errors
gets based off the hit count for the outer query for getJsErrors
and this value looks to be accurate, it correlates with what the APM > Errors tab has for sum(occurences) in the table.
There are possibly some problems that could arise from these queries, but they don't seem to actually be causing any problems in this case. 1) Wildcard patterns start with a *
, from the docs: Avoid beginning patterns with * or ?. This can increase the iterations needed to find matching terms and slow search performance.
and 2) Cardinality aggregation counts are approximate.
There is some thinking that the URL filter is potentially not working properly, but as far as I can see it's applied correctly in both queries and url.full
looks to be the field we should be using. I think maybe the URL filtering is a red herring (and was correctly fixed in the PR originally set to address this). */blog/*
in this instance may well just be surfacing incorrect values easily.
Thinking more on it I do wonder if ((data?.totalErrorPages ?? 0) / totalPageViews) * 100
is actually the correct calculation. The totalErrorPages
query pulls from the errors index, and filters on processor.event
set to error
. And the totalPageViews
query pulls from the transactions index, and filters on processor.event
set to transaction
. So we could say this is error events
/ transaction events
, for a true error percentage rate, should this not be error events
/ total events
(total events I guess being the sum of errors and transactions)?
My knowledge of exactly what the RUM agent sends and when in the course of a page load type is very murky. It would be helpful to know exactly which events are sent - transaction vs error - during the page load. I see that for APM > Transaction error rate the outcome
field of the transaction events is used for determination, although this is obviously different to "Page views that errored".
@Kerry350 , thanks for the great write up.
The calculation ((data?.totalErrorPages ?? 0) / totalPageViews) * 100
seems correct to me. We shouldn't use the total events (sum of error and transaction events) for totalPageViews
.
Furthermore, the queries seems correct as well. I agree using cardinality of transaction.id
might be better compared to transaction.duration.us
(I'm not sure why we didn't use that though), although I assume it would yield the same result (might be worth checking).
If all the above is correct, then there might be a problem in the data, but I'm not sure what could cause this issue.
Having another look at the queries I've found these two issues:
transaction.marks.navigationTiming.fetchStart
would reduce the number of total pages considered in the error rate. This condition should be removed for calculating error rate but I think it's necessary for other aggregations : {
"exists":{
"field":"transaction.marks.navigationTiming.fetchStart"
}
}
cardinality
instead of value_count
actually results in different values. This might be due to using estimation for value_count. Applying the above two fixes to the queries posted above. I get totalErrorPages: 284
and pageViews: 284
, which would result in error rate of 100% while before I was getting 138% . These two issues should be fixed but there might be other issues as well.
@jahtalab Thank you for looking at this.
checking for existence of transaction.marks.navigationTiming.fetchStart would reduce the number of total pages considered in the error rate. This condition should be removed for calculating error rate but I think it's necessary for other aggregations
I must admit I skirted past this as there was a comment in the code above it saying Adding this filter to cater for some inconsistent rum data not available on aggregated transactions
, so I figured it was important / needed for everything. I'll remove this from the top level filter, so that it doesn't affect pageViews
, and instead add an extra filter aggregation, the percentile aggs can then be nested under there.
Using cardinality instead of value_count actually results in different values. This might be due to using estimation for value_count.
Oh, okay. I assume to use cardinality
we'd want to switch transaction.duration.us
for transaction.id
as we'd now be dealing with distinct values, rather than number of extracted values?
To get page views count, I think a better solution I have been keeping in mind to refactor this, is to use track_total_hits on query and just use hits total value. That will be more accurate here.
On Wed, Oct 21, 2020 at 6:20 PM Kerry Gallagher notifications@github.com wrote:
@jahtalab https://github.com/jahtalab Thank you for looking at this.
checking for existence of transaction.marks.navigationTiming.fetchStart would reduce the number of total pages considered in the error rate. This condition should be removed for calculating error rate but I think it's necessary for other aggregations
I must admit I skirted past this as there was a comment in the code above it saying Adding this filter to cater for some inconsistent rum data not available on aggregated transactions, so I figured it was important / needed for everything. I'll remove this from the top level filter, so that it doesn't affect pageViews, and instead add an extra filter aggregation, the percentile aggs can then be nested under there.
Using cardinality instead of value_count actually results in different values. This might be due to using estimation for value_count.
Oh, okay. I assume to use cardinality we'd want to switch transaction.duration.us for transaction.id as we'd now be dealing with distinct values, rather than number of extracted values?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/elastic/kibana/issues/78926#issuecomment-713694056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2X3QK53SJ2567NMSEMDTDSL4C5DANCNFSM4R645GXQ .
@Kerry350 , that's correct using cardinality
on transaction.id
is the option I tried (forgot to mention that).
Great, thanks both 👍
I've opened this draft PR #81512 which adds the fixes discussed here (making sure transaction.marks.navigationTiming.fetchStart
doesn't exclude documents from the page views value, and using the hits value of the outer query to determine the page views rather than an aggregation).
Whilst this does improve the error rate, I'm still seeing a higher than 100% error rate. In the specific range I'm looking at the rate decreased from 108% to 106%, which was a discrepancy of around 50 documents (that would make it 100%, so still probably inaccurate even then).
After this the query that handles page views looks like so (the errors one remains unchanged), this is the query that actually hits ES after apmEventClient.search
is done:
{
"index":[
"apm-*"
],
"body":{
"query":{
"bool":{
"filter":[
{
"range":{
"@timestamp":{
"gte":1603376807878,
"lte":1603379689350,
"format":"epoch_millis"
}
}
},
{
"term":{
"transaction.type":"page-load"
}
},
{
"wildcard":{
"url.full":{
"value":"*/blog/**"
}
}
},
{
"terms":{
"service.name":[
"elastic-co-frontend"
]
}
},
{
"terms":{
"processor.event":[
"transaction"
]
}
},
{
"range":{
"observer.version_major":{
"gte":7
}
}
}
]
}
},
"size":0,
"track_total_hits":true,
"aggs":{
"backEnd":{
"percentiles":{
"field":"transaction.marks.agent.timeToFirstByte",
"percents":[
50
],
"hdr":{
"number_of_significant_value_digits":3
}
}
},
"domInteractive":{
"percentiles":{
"field":"transaction.marks.agent.domInteractive",
"percents":[
50
],
"hdr":{
"number_of_significant_value_digits":3
}
}
}
}
},
"ignore_throttled":true
}
At this point, bar basic time and type filters, the queries are fairly lean (insofar as what actually derives page views, ignoring the percentiles stuff).
Two things:
I double checked just tracking the hit count was the same as cardinality of transaction.id
, there was at most a drift of 1 here for me, I would assume this is due to cardinality being approximate.
I switched totalErrorPages
to use a terms
aggregation with a ridiculously high size
just to see if it matched what cardinality
was producing, these were the same.
@jahtalab I'm at a bit of a loss at this point as to where the inaccuracy stems from 🤔
@jahtalab @vigneshshanmugam there is definitely data inconsistency.
There are errors events with transaction.type
page-load
, which doesn't have corresponding page load transaction events,
for example on https://dev-next-oblt.elastic.dev/
if you search with transaction.id : "430bfbfb4d8d30fb" , it has an error event but there is no transaction event.
This is why we are getting more error events compared to transaction events.
Is this an expected behavior? if not, definitely a bug in apm-server or rum agent.
As discussed on the call the fix for this that we will get in to the 7.10 release will be:
Error rate
(label and value) from the JavaScript errors
sectionpage-load
so that all errors are shown
Kibana version: Master branch
Browser version: All Browsers
Original install method (e.g. download page, yum, from source, etc.): from source
Describe the bug: User Experience page
JavaScript Errors
tab has inconsistent error fields/group compared to the APM ones.Steps to reproduce:
Observability
JavaScript Errors
sectionExpected behaviour: Both of the pages must have same consistent errors and filtering on the URL in the
UX App
should show correct percentage of errors.Screenshots (if relevant):