Closed rviscomi closed 2 years ago
I'm still wondering if there is value in quantity over quality. Even lightweight, is it useful to know which properties are used most often?
For some of the sensitive ones, like window.navigator, I could see there being more value in a more-expensive hook that gets the call stack and counts occurrences by script origin. (new Error()).stack
Should do it from what I can tell but you wouldn't want to do that for every method. In the case of navigator, it could identify the analytics scripts, etc instead of the origins and see if there is a smaller set of companies that would be impacted by anonymization efforts.
I reduced the amount of noise in my local tests by renaming the WPT metric name to 00_reset
rather than just reset
. But there's still some stuff getting picked up, presumably caused by WPT itself:
{
"call_stacks": {
"navigator.userAgent": []
},
"performance.clearMarks": 1,
"performance.clearMeasures": 1,
"performance.getEntriesByType": 5,
"performance.timing": 9,
"performance.timing.domInteractive": 1,
"performance.timing.domContentLoadedEventStart": 1,
"performance.timing.domContentLoadedEventEnd": 1,
"performance.timing.domComplete": 1,
"performance.timing.loadEventStart": 1,
"performance.timing.loadEventEnd": 1,
"Array.prototype.indexOf": 182,
"String.prototype.match": 50,
"String.prototype.substring": 20
}
@pmeenan I assume there's not much we can do about those besides subtracting after the fact.
@rviscomi we may be able to move the WPT metrics collection to be after running the custom metrics but that will open it up to a little risk of custom metrics breaking the WPT metrics (mostly marks/measures if a custom metric does anything destructive)
For our private instance, how feasible would it be for WPT code to execute between toggling httparchive_enable_observations
? Not a big deal as long as we have a consistent way to mask out the WPT observations, but not sure about "consistent".
I moved things around a bit to minimize the amount of agent logic between the scripts being injected and the custom metrics.
{
"call_stacks": {
"navigator.userAgent": []
},
"Array.prototype.indexOf": 182,
"String.prototype.match": 50,
"String.prototype.substring": 20
}
Not sure where those uses are coming from but it's possible some of them are from the injected script itself.
Added them to the props to trace 😄
const PROPERTIES_TO_TRACE = new Set([
'navigator.userAgent',
'String.prototype.substring',
'String.prototype.match',
'Array.prototype.indexOf'
]);
{
"call_stacks": {
"navigator.userAgent": [],
"Array.prototype.indexOf": {
"Error\n at Array.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:32:50)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 5,
"Error\n at Array.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:16:75)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 1,
"Error\n at Array.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:32:50)\n at WptAgentFlatten (<anonymous>:17:33)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 11,
"Error\n at Array.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:16:75)\n at WptAgentFlatten (<anonymous>:17:33)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 165
},
"String.prototype.match": {
"Error\n at String.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:40:22)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 9,
"Error\n at String.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:40:22)\n at WptAgentFlatten (<anonymous>:17:33)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 41
},
"String.prototype.substring": {
"Error\n at String.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:33:33)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 5,
"Error\n at String.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:36:49)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 5,
"Error\n at String.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:33:33)\n at WptAgentFlatten (<anonymous>:17:33)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 5,
"Error\n at String.get (<anonymous>:99:19)\n at WptAgentFlatten (<anonymous>:36:49)\n at WptAgentFlatten (<anonymous>:17:33)\n at WptAgentReportPerformanceTiming (<anonymous>:50:90)\n at PerformanceObserver.<anonymous> (<anonymous>:55:5)": 5
}
},
"Array.prototype.indexOf": 182,
"String.prototype.match": 50,
"String.prototype.substring": 20
}
Ahh, looks like it is all from the injected performance observer that gets the LCP and CLS element details at the time of the event.
We can probably update the injected script to either disable/re-enable measurement when it is handling an event or bypass the overrides to call the native methods directly.
Could that execute after custom metrics? 🤔
No, it has to execute while the page is loading so it can get the state of the LCP element at the time the event fires.
Tweaking the inject script to remove as many instrumented methods as possible (switching to comparisons insted of indexof, etc). I'm down to one substring call now.
I wouldn't worry too much about that since this list of observers may grow in the future. I think we'll need a way to detect when a feature is used directly by WPT and ignore it. It'd add a lot of overhead but we can get call stacks for every feature and look for WPT keywords. WDYT?
For now we're in good shape. Updated test
{"call_stacks":{"navigator.userAgent":[]}}
I prefer that we be careful about the methods that we instrument to make sure there is enough of a ROI/use case to measure them before we talk about adding more overhead to each event because adding call stacks to every string and array method access is going to add SIGNIFICANT overhead to some sites.
I like what we did with navigator.userAgent because we have an explicit interest from the privacy analytics for seeing what specific scripts are fingerprinting.
On the call-stacks side of things, I think we want to do some agent-side processing on those as well to pull out just the top-level origin of the URL of the calling script (maybe in addition to the full stack) so it is easy to query.
I prefer that we be careful about the methods that we instrument to make sure there is enough of a ROI/use case to measure them before we talk about adding more overhead to each event because adding call stacks to every string and array method access is going to add SIGNIFICANT overhead to some sites.
I like what we did with navigator.userAgent because we have an explicit interest from the privacy analytics for seeing what specific scripts are fingerprinting.
On the call-stacks side of things, I think we want to do some agent-side processing on those as well to pull out just the top-level origin of the URL of the calling script (maybe in addition to the full stack) so it is easy to query.
@pmeenan if you're still ok with the full stack in the results, I think we can add some shared functions on the SQL side to make processing easier. With that, does this PR LGTY?
Good suggestion, thanks. Added.
Progress on:
https://github.com/HTTPArchive/almanac.httparchive.org/issues/2881 https://github.com/HTTPArchive/almanac.httparchive.org/issues/2890 https://github.com/HTTPArchive/almanac.httparchive.org/issues/2891
Changes the observer behavior to be able to handle wildcards in the property pathname. For example,
navigator.*
will observe all properties on thenavigator
object. Special handling was added for__proto__
properties. Also adds the ability to capture the call stack for any particular properties.Tests
example.com
Results
It seems like the reset isn't being applied early enough. This might fix itself after the change is merged and the WPT agents consistently execute them in alphabetical order.
As expected,
navigator.userAgent
is not detected and the call stack is empty.https://codepen.io/rviscomi/pen/rNJMmgy
Results
Omitted the custom metric observations from these results for clarity, but they're the same as above.
nytimes.com
Results
Show custom metric results
```json { "call_stacks": { "navigator.userAgent": { "Error\n at Navigator.get (