elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.22k stars 519 forks source link

[RUM] Extend context.page for allowing new metrics as part of transaction #3657

Closed vigneshshanmugam closed 3 years ago

vigneshshanmugam commented 4 years ago

Currently RUM agent sends page specific information for all managed transactions as part of the transaction context. eg :

transaction.context.page =  {
    referer: "blah"
    url: "http://example.com"
}

some of the metrics like Memory, Network information(4g, 3g), Hardware concurrency(whether the user is on low/high end device) can only be added to context.custom which feels not so intuitive. I would like to propose we find a common schema for adding these navigation level metrics as part of the transaction. It would help with some ideas like

These are some of the metrics I thought we could send as part of the page-context.

{
  page: {
      referer: document.referrer,
      url: window.location.href,
      networkInfo: {
        effectiveType: '4g',
        rtt: 50,
        downlink: 10,
        saveData: false
      },
      hardwareConcurrency: 4, // cores
      deviceMemory: 4 (float), // A full list of possible values is as follows: 0.25, 0.5, 1, 2, 4, 8
      servedViaServiceWorker: "yes"  // ["yes", "no", "unsupported/not applicable"] 
    }
} 

Does allowing * in the page context solve the use case for RUM since its already used only by the RUM agent ?

axw commented 4 years ago

Initial thoughts:

The networkInfo, hardwareConcurrency, and deviceMemory fields seem orthogonal to "page" context. They might only be included when you send a "page load" transaction, but I don't think we should embed them there.

servedViaServiceWorker: '' // Page served from service worker cache

Why is this a string type - should it be a boolean? If not, what does the value represent?

vigneshshanmugam commented 4 years ago

They might only be included when you send a "page load" transaction, but I don't think we should embed them there.

Not really, We need that info for all transactions including page-load, route-change, click etc and for errors as well. As it would help with creating dashboards and filter for these information to show relevant events.

networkInfo sounds like fields that could extend the ECS network fieldset. It might be worth creating a proposal about this in https://github.com/elastic/ecs, as it seems like it would have relevance beyond APM/RUM.

yeah some fields can be added to the Network fieldset in ECS, but save_data and effective_type seems not so useful to the ECS Context as its exposed only by the browser.

servedViaServiceWorker

Still exploring whether we want a boolean or something like a service worker status whether the page is controlled, uncontrolled (Supported but not served by SW) or notsupported.

vigneshshanmugam commented 4 years ago

@axw @simitt Can i get some inputs on this? Would like to get this in 7.8 if possible.

Also the above is Chrome only API and naming might get modified in future. I am totally not sure if we should create a proposal in ECS to proceed with this change.

axw commented 4 years ago

They might only be included when you send a "page load" transaction, but I don't think we should embed them there.

Not really, We need that info for all transactions including page-load, route-change, click etc and for errors as well. As it would help with creating dashboards and filter for these information to show relevant events.

Right, that just reinforces the point I was trying to make: those metrics are properties of the browser, not the transaction/error/whatever, so perhaps we should send them independently.

Anyway, I'd personally like to see more information about the motivation and desired outcomes for doing this. Having a mockup of how it would be used would help understand the requirements.

vigneshshanmugam commented 4 years ago

Right, that just reinforces the point I was trying to make: those metrics are properties of the browser, not the transaction/error/whatever, so perhaps we should send them independently.

Interesting take. I would be interested to move these page related date to metadata as well and extend it there instead of tying it to transaction, errors etc and potentially this duplicated data would help reduce the payload in general. Do you already have an approach in mind?

Anyway, I'd personally like to see more information about the motivation and desired outcomes for doing this. Having a mockup of how it would be used would help understand the requirements.

For sure, @drewpost Do we have some mockups on using the Network information for filtering the page loads?

jalvz commented 4 years ago

Any updates here? What are next steps? Do we have a clear idea of what should be aligned with ECS, and what not? Is it enough to spec these in V3, or they are needed in V2 as well? If V3 only: would make sense to spec these as metricsets, (maybe not under samples, but possibly introduce eg. a top level browser field)? ...

jalvz commented 4 years ago

@elastic/apm-agent-rum ping

hmdhk commented 4 years ago

@jalvz

@drewpost , Would you please provide some context around where this data might be used (e.g. dashboard mockups)? This affects our decision about putting these on transaction vs metrics!

vigneshshanmugam commented 4 years ago

I think the metricsets idea is worth considering (instead of using metadata)

Not sure if it would be good to put inside metricsets. The data on its own is not quite so useful unless we tie it to the page-load or other transaction events. But in any case, @drewpost We are waiting for your mockups to proceed here.

drewpost commented 4 years ago

Hey everyone - sorry I'm late to the party. So - the user context here is that these particular data points are essential for dissecting the performance data we receive in RUM. If we imagine this scenario: We get a page load duration of 10 seconds. On it's own, this is slow but I need more information to understand what I can do to bring that down. Is it that I have a high proportion of visitors on low powered devices? or on high latency mobile connections? etc.

Primary use would be to filter the beacons used to calculate a series. ie show me the first contentful paint duration for all users on low powered devices running Android (and maybe compare it to another series with the same data but for high powered devices)

Compare Time Series FIltered

Secondary use is as a breakdown ie I want to breakdown a load distribution by network connection

Compare Distribution FIltered

We would also feed this into a commonality analysis when trying to understand the root cause of a detected anomaly. Scenario - your page load time spikes suddenly vs normal and it triggers an anomaly. What's causing it? It could be something technically wrong or it could be something as simple as the marketing team putting out a big promotional campaign during the evening commute home and all of a sudden you had an influx of traffic from Instagram on mobile devices on high latency mobile connections which resulted in a slower page load time owing to the device and connection profile of the traffic on your site at that time.

axw commented 4 years ago

Thanks @jahtalab, @vigneshshanmugam, and @drewpost. It's clear to me now that they should be part of the transaction docs. I think we'll want to do something similar for serverless eventually.

We might still want to align some of the field names with the transaction-independent metrics that backend agents collect - @jalvz will follow up on this.

jalvz commented 4 years ago

related: https://github.com/elastic/apm-server/issues/3912

vigneshshanmugam commented 4 years ago

@jalvz @axw Can we assume that these fields will be supported in the 7.10 and can add it to the milestone?

axw commented 4 years ago

@vigneshshanmugam I would say that is highly unlikely.

First https://github.com/elastic/apm/issues/288 needs to be completed. Then the new, non-RUM specific network.* fields will be proposed as additions in ECS. That will take some time.

We could perhaps get the two highly RUM-specific fields, page.saveData and page.servedViaServiceWorker. The system.cpu.cores and system.memory.total fields are preexisting so we can just add them into the intake schema for RUM and serverless.

vigneshshanmugam commented 4 years ago

Thanks for the details @axw, Network info is the main thing we need from RUM perspective and other fields can wait and we can do it together if it's going to take longer.

axw commented 3 years ago

@elastic/apm-agent-rum if this is still important, please drive the ECS/data model discussion and reopen this when there's some resolution. Closing for now - otherwise this will likely just linger forever.

vigneshshanmugam commented 3 years ago

We had a discussion internally with the Synthetics team. As its already pointed out in different comments https://github.com/elastic/apm-server/issues/3657#issuecomment-641950527. Most of the fields does not really fit in the ECS field definition for the networkInfo except for the rtt time and would work only for the RUM/Synthetics events as they are browser specific (chrome only).

With that in mind, we would like to experiment first by adding these fields in the APM server schema and move it to the ECS format if we see more products using it rather than starting it from the ECS side . /cc @elastic/apm-agent-rum @andrewvc @drewpost @paulb-elastic

axw commented 3 years ago

@vigneshshanmugam OK. More than anything I want to ensure the data model is sensible, not specifically that it's under ECS. Even if it's experimental, there's a cost we need to carry in the APM Server code base. If the data model is incoherent (e.g. a page having concurrency) then it immediately becomes tech debt -- and one we need to carry for a long time for backwards compatibility.

Can you please address some open questions I have?

vigneshshanmugam commented 3 years ago

@axw Totally! and ++ to the all the points you mentioned about the tech debt. RUM is a different use-case and i am not 100% sure if we could always find a common ground but will definitely improve things along with Synthetics and other mobile APM agents.

Does servedViaServiceWorker need to be a string? Should it be an bool instead? (false, true, unset = n/a)

Let's remove it for this iteration. I am going to wait a bit longer on this one. The reason being

There might be another usecase, but will open a new issue when we see it fit.

Can we use the existing field name system.cpu.cores instead of page.hardwareConcurrency? Can we use the existing field name system.memory.total instead of page.deviceMemory

I am with you there. We can use the system fields for that. Do you know how we can incorporate that in our RUM schema?

Can we nest networkInfo under a new browser object field, rather than page?

I like browser. Another idea is to nest them under navigator which would align with where data is extracted from (https://developer.mozilla.org/en-US/docs/Web/API/Navigator). Works for service worker as well.

I will update the description once we finalize on the schema here.

axw commented 3 years ago

Can we use the existing field name system.cpu.cores instead of page.hardwareConcurrency? Can we use the existing field name system.memory.total instead of page.deviceMemory

I am with you there. We can use the system fields for that. Do you know how we can incorporate that in our RUM schema?

Cool. I'll put up a draft PR soon where we can discuss how this would look in the intake API. My current thinking is that they would go in the system metadata.

I like browser. Another idea is to nest them under navigator which would align with where data is extracted from (https://developer.mozilla.org/en-US/docs/Web/API/Navigator). Works for service worker as well.

I'm not familiar enough with the nomenclature here to make a call. "Navigator" sounds a bit like technical jargon to me. Is "browser" incorrect?

vigneshshanmugam commented 3 years ago

Cool. I'll put up a draft PR soon where we can discuss how this would look in the intake API. My current thinking is that they would go in the system metadata.

Sounds good.

I'm not familiar enough with the nomenclature here to make a call. "Navigator" sounds a bit like technical jargon to me. Is "browser" incorrect?

Browser feels a bit generic which is why i suggested navigator, cant think of any better names at the moment. Let me come back to you on this with other suggestions. We can keep it browser for now.