HTTPArchive / legacy.httparchive.org

<<THIS REPOSITORY IS DEPRECATED>> The HTTP Archive provides information about website performance such as # of HTTP requests, use of gzip, and amount of JavaScript. This information is recorded over time revealing trends in how the Internet is performing. Built using Open Source software, the code and data are available to everyone allowing researchers large and small to work from a common base.
https://legacy.httparchive.org
Other
328 stars 84 forks source link

New script to detect importScripts(), service worker events and service worker properties #204

Closed demianrenzulli closed 3 years ago

demianrenzulli commented 3 years ago

Hi folks,

As discussed in #203 I decided to create a PR for this script (even when is not optimal), since, after analyzing some tests with @tunetheweb, having it merged for Monday's crawl could help us get an idea of the number of false negatives in serviceWorkerInitiatedURLs.

We have added three new fields to the response: importScriptsInfo, swEventListenersInfo, wPropertiesInfo, which use Rick's script for Workbox detection and combines it with the regular expressions used last year to detect SW events and properties (sw_events.sql), plus a new one to detect calls to importScripts().

Tests

It would be interesting to see how many cases like Twitter we have, to see if we are under-counting many sites that use service workers.

From what we could check with @tunetheweb this number represent a very small percentage.

It would be great if @rviscomi or @jeffposnick can take a look at some point.

cc // @obto

demianrenzulli commented 3 years ago

Hi @rviscomi! Thank you for taking a look at this PR.

I'll be running some tests, to decide what's the best course of action for both suggestions, according to what the data returns and how we visualize it.

One important question:

I've noticed that the original script takes into account all the URLs (response_bodies), not only the service worker ones (serviceWorkerInitiated).

For Workbox-related things or importScripts() this is fine, since these strings won't appear outside service workers. In other cases, like the onmessage property or message event, there can be a lot of instances of them outside service worker files. This year I was planning to also include other service worker related objects, like Cache, that can also be accessed client-side, leading to stats that might not necessarily be related to service workers usage or PWAs.

Correct me if I'm wrong, but I believe one way of mitigating this, would be to use serviceWorkerInitiated instead of response_bodies, but as you've mentioned here, there might be many false positives, like the case of the Twitter test I shared before.

Now, if based on the data we receive from the June crawl, we are able to determine that these false positives are relatively low (for example, by comparing how many tests have an empty serviceWorkers field, and a non-empty: workboxInfo), or once this PR is merged, by checking tests with an empty serviceWorkers field and a non-empty importScriptsInfo / swEventListenersInfo / swPropertiesInfo, would it be more advisable to run the scripts on serviceWorkerInitiated instead?

I believe 2019 and 2020 were taking the whole response_bodies into account, so I don't want to introduce a change that might produce an intended impact on the data. Also, since we couldn't introduce these new metrics, we might not have a lot of time to make that decision.

Thoughts are welcome!

cc // @tunetheweb

demianrenzulli commented 3 years ago

Also, this comment from @tunetheweb is very interesting:

WPT also detected they were service worker calls as they are in blue. So could also ask @pmeenan how it does that and use that potentially?

For the Twitter test, that we couldn't detect the service worker with our script, WPT shows the service worker related requests in blue in the waterfall.

image

Maybe we can reuse that logic WPT uses to detect service worker activity inside our script to be sure of which tests things from a SW?

Cheers.

pmeenan commented 3 years ago

WPT highlights any request that belongs to a different document from the main page in blue. It will show SW but also iFrames. "documentURL" in the request data.

demianrenzulli commented 3 years ago

Thanks for confirming @pmeenan! In that case, we won't be able to use that to isolate SW-only traffic.

rviscomi commented 3 years ago

Correct me if I'm wrong, but I believe one way of mitigating this, would be to use serviceWorkerInitiated instead of response_bodies, but as you've mentioned here, there might be many false positives, like the case of the Twitter test I shared before.

That's right. It would provide fewer false positives but we'd lose out on the Twitter edge case.

Now, if based on the data we receive from the June crawl, we are able to determine that these false positives are relatively low (for example, by comparing how many tests have an empty serviceWorkers field, and a non-empty: workboxInfo), or once this PR is merged, by checking tests with an empty serviceWorkers field and a non-empty importScriptsInfo / swEventListenersInfo / swPropertiesInfo, would it be more advisable to run the scripts on serviceWorkerInitiated instead?

Reevaluating after the June crawl sounds like a good plan. We can also compare to the ServiceWorkerControlledPage use counter.

tunetheweb commented 3 years ago

We can also compare to the ServiceWorkerControlledPage use counter.

@demianrenzulli and I looked at this for 2020 stats and it was pretty damn close actually.

Web Almanac counted 53,366 sites for mobile and 42,521 sites for desktop.

Blink Usage for Aug 2020 had 55,019 and 49,305. And they presumably will be counting more than the home page and all sites not just CrUX ones?

So that’s VERY close in the grand scheme of things and gives real confidence the 2020 methodology (which is what we are currently proposing for 2021 but just moving the counting to be done during the crawl) is representative.

tunetheweb commented 3 years ago

Actually where does the httparchive.blink_usage data come from? Is that from our HTTPArchive crawl (so will be just home pages after all) or from Chrome’s stats?

pmeenan commented 3 years ago

The feature usage comes from the crawl.

tunetheweb commented 3 years ago

The feature usage comes from the crawl.

Ah then the closeness is more understandable 😔

I guess the difference would be sites like Twitter where we can’t see it registering a service worker even though it does use one.

rviscomi commented 3 years ago

One other consideration for minimizing false positives (eg onmessage outside of a SW) is to test for patterns that would disqualify the script from being a service worker. For example, if we find document.querySelector in the script with the onmessage event listener, then it's likely that the script isn't a SW. WDYT would that kind of approach work, and if so what kinds of signals can you think of that would strongly indicate that we should discount the script?

demianrenzulli commented 3 years ago

Thanks for sharing your thoughts again, @rviscomi! It's really helping us understand what can be done to make our analysis more precise.

So far, I have found a potentially strong usage of features outside service worker in the following cases:

All of them can be accessed both from pages and service workers, and their usage might be quite frequent in both contexts. So, we either (1) exclude them from our analysis to avoid false positives, or (2) find a way of mitigating this issue by focusing only in serviceWorkerInitiated or something similar.

For (1): For the three features, I thought it might be interesting to know how many service workers use the message event listener, since that could let us know how many sites are using "Page to Service worker" communication techniques. Excluding fetch() or caches, might not be so bad, in my opinion. I assume that most service workers use both of these features, so I don't think that knowing that will give us very interesting insights.

Regarding (2): If we decide to implement a technique to filter service worker only activity, this comment:

For example, if we find document.querySelector in the script with the onmessage event listener, then it's likely that the script isn't a SW. WDYT would that kind of approach work, and if so what kinds of signals can you think of that would strongly indicate that we should discount the script?

Sounds like a very good idea. In general, workers don't have access to the Window and therefore the Document object either and have limited access to browser APIs.

Since Window is implicit, I believe that any call to document.* (like the example you've shared document.querySelector), should be forbidden inside worker contexts.

I'm wondering how this could be implemented? Is it possible to create an exception in pwa.js, or is it something we can do after getting the data?

demianrenzulli commented 3 years ago

BTW: cc @jeffposnick to validate this:

Sounds like a very good idea. In general, workers don't have access to the Window and therefore the Document object either and have limited access to browser APIs.

Since Window is implicit, I believe that any call to document.* (like the example you've shared document.querySelector), should be forbidden inside worker contexts.

jeffposnick commented 3 years ago

That's correct. document isn't part of the ServiceWorkerGlobalScope.

demianrenzulli commented 3 years ago

Hi folks,

I just updated this PR, by adding two additional fields: swMethodsInfo and swObjectsInfo. Tests:

Output: $.data.runs[1].firstView.pwa.*

Before starting to work on @rviscomi code to filter SW only files, I wanted to go back to on of his previous comments:

Do you need to distinguish between addEventListener and on usage in your analysis? If not, consider combining the regexes or merging the results into a single output property.

I'm not familiar on how this information will be consumed and if it's better to combine some of the fields that this script returns into one.

I'd like to hear @tunetheweb thoughts on this, before changing the current structure. Maybe we can discuss in tomorrow's sync.

demianrenzulli commented 3 years ago

Hi team,

As discussed in our sync with @tunetheweb and @obto earlier this week, I've been working in some updates to the script to get it closer to the final version, so we can hopefully use it for July's crawl. @rviscomi (and others) I would appreciate your thoughts:

New fields

You'll notice that I have included more PWA features to test, like navigationPreload and events like appinstalled.

Output format

I added an option to getInfoForPattern() to return only the matching group for some patterns. For example: importScripts(...) only returns the contained URL. Same for addEventListener(), etc.

Granularity level

As discussed, we would prefer to return the fields in the more granular way, instead of combining them, so I tried to group them according to the feature they belong to.

With that say: swEventListenersInfo and swPropertiesInfo continue being separate fields, even when they might test similar things.

Filtering "service worker only" activity

I took @rviscomi's suggestion and tried the suggested technique:

const disqualifyingSWPattern = /\bdocument\.\w+/g;

While I was initially happy with the preliminary results, I found some cases like Spotify's service worker, which code contains calls to document. As discussed this is illegal in this context, but might be the result of using some build tools or polyfills to generate the service worker. We discussed this with @jeffposnick offline, and thought that it might be risky to use this pattern as a way to detect SW-only activity.

For that reason, I decided to exclude two features that are not SW-specific and that are very commonly used in pages: fetch() and message. After that the results are much more clean.

Something else that we can do after analyzing June's crawl would be creating a condition like:

"If it doesn't have a serviceWorkers field, but it has SW-specific methods, we can consider that it uses a service worker".

Test cases

I tested 20 sites that have service workers and the output seems to be working well. I was hoping we can use that as test cases every time we introduce changes.

@tunetheweb: you can take a look at any test, for example, this one, to see how the JSON looks like.

Sorry for the length of this comment! Too many decisions have been made that I wanted to share.

tunetheweb commented 3 years ago

Hey @demianrenzulli is this expected in your results:

image

Given that you are searching for very specific importScripts text how can the e get in there?

tunetheweb commented 3 years ago

Oh and btw I just found out yesterday that if you use URLs like this: https://www.webpagetest.org/jsonResult.php?test=210610_AiDcSS_378a25e000712f616bd14f59f47d1142&pretty=1&rv=0&requests=0&standard=0&median=0&runs=1&console=0 with the query params to strips out a lot of the JSON before sending back, it is a lot quicker to load when testing your scripts and doesn't kill your Chrome instance! 😀

demianrenzulli commented 3 years ago

Hi @tunetheweb, thank you for taking a look at the tests and share your feedback!

The output you found belongs to an importScripts() call inside one of Workbox files:

https://storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js

This is the relevant portion of the code:

loadModule(t) {
   const e  = this.i(t);
      try {
         importScripts(e), this.o = !0
      } catch (s) {
         throw console.error(`Unable to import module '${t}' from '${e}'.`), s
      }
}

As you can see, the regular expression has detected the pattern well, but since the call to importScripts() receives a variable as a parameter, it's returning that value as its content.

I didn't want to force the the parameter sent to importScripts()to start with ' or ", to avoid excluding potentially valid cases, and also in case we want to come up with an estimation like: "X% of sites use importScripts()".

Maybe at the time of creating the raking of most used libraries we could filter out these results that don't add much? It seems like this is this is the query used last year, so any recommendations you might have are highly welcome.

tunetheweb commented 3 years ago

Ah I loaded the wrong script so couldn't see it! My bad.

Is it useful to know what they import? Or just that they use importScript? Do we need the parameter at all or just the list of scripts that call it?

i.e. instead of this:

"importScriptsInfo": {
    "https: //www.tiktok.com/sw.js": [
        "'https://storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js'"
    ],
    "https: //storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js": [
        "e"
    ]
}

Should we have this?:

"importScriptsInfo": [
    "https: //www.tiktok.com/sw.js",
    "https: //storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js"
]

Or maybe each just this?:

"importScriptsInfo": true
demianrenzulli commented 3 years ago

Thanks again @tunetheweb!

If we do this:

"importScriptsInfo": [
    "https: //www.tiktok.com/sw.js",
    "https: //storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js"
]

We know that there are two files that contain importScripts(), but we don't know which are the libraries they are importing, so we won't be able to create the ranking of most used libraries. Am I right?

Same goes for "importScriptsInfo": true. It would help us know how many sites have importScripts(), but not which libraries they are importing.

Please, let me know if I understood this correctly.

tunetheweb commented 3 years ago

Ah you're correct yes. We want to track the most used libraries, not just the use of import scripts.

Then maybe check it includes at least one .? Obviously this will exclude all the variable type calls to importScripts but they're of limited use in it's current form anyway.

Or we just leave as is and filter out in the results?

WDYT?

demianrenzulli commented 3 years ago

I really like this idea:

"Then maybe check it includes at least one .?", but what if a site has a service worker with only an importScripts() that's being loaded by sending it a variable?

While the variable wouldn't be useful for the rankings, not knowing that site has an importScripts() call won't let us come up with a realistic number for the % of sites that uses importScripts().

That's the first row in last year's table.

Am I right?

tunetheweb commented 3 years ago

I say leave as is then and can clean it up in Sheets.

I also had a look at your test cases and looks pretty thorough! My only concern is the number of "Empty serviceWorkers field with non empty SW methods" cases you have. Is that because you specifically looked for them? Or is our logic so limited that it's picking up so few of them? Anything we can do to improve it that you've seen from these sites? Though guess we do have Blink Usage and Lighthouse to fall back on for those tests.

demianrenzulli commented 3 years ago

I have to admit that I am a bit concerned too about the potentially large number of false negatives that we have seen in the tests for the serviceWorkers field. One hypothesis isthat, since I tested mostly well known sites that might be more sophisticated and use advanced minification techniques, this number is relatively large in this group, but maybe won't be when analyzing all the URLs from the dataset.

If we find that the number of false positives is still considerable, something that we can do is to create a condition like:

"If a site has an empty serviceWorkers field but either importScriptsInfo or swEventListenersInfo or swMethodsInfo are not empty, we can assume that it has a service worker".

A priori, none of these fields should be populated for sites that don't have it, since the methods they check for shouldn't appear outside of service workers. But maybe it's good to have the data to be 100% sure.

Please, let me know what you think!

demianrenzulli commented 3 years ago

Here is a test for web.dev, which doesn't have a service worker, and therefore, has all the SW-related fields mentioned before, empty.

demianrenzulli commented 3 years ago

Hi folks, a final update on my side:

I just committed an experimental field serviceWorkerHeuristic that will be true if either serviceWorkers, workboxInfo, swEventListenersInfo or swMethodsInfo are non-empty.

If you take a look at the SW Heuristics column of the test case sheet, you'll see that this value is true for all the PWA sites tested there.

I added a new tab in the sheet, Non-PWA sites, with URLs for sites that don't register a service worker, and you'll see that in all cases SW Heuristics is false.

This might help us get a more accurate idea of how many sites actually have service workers, but since the number of tests I performed is relatively small (30 approximately), I can't be sure what the result will be in the order of thousands when the actual crawl takes place.

I hope this helps.

demianrenzulli commented 3 years ago

@rviscomi as discussed with @tunetheweb I’ve resolved all the comments and don’t plan on making any further changes here if you can review again when you have a moment?

I did my best to cover as many test cases as possible. @obto as we discussed, this seems to be the only way to test, but if you have any thoughts on how to test custom metrics that would be great.