HTTPArchive / legacy.httparchive.org

<<THIS REPOSITORY IS DEPRECATED>> The HTTP Archive provides information about website performance such as # of HTTP requests, use of gzip, and amount of JavaScript. This information is recorded over time revealing trends in how the Internet is performing. Built using Open Source software, the code and data are available to everyone allowing researchers large and small to work from a common base.
https://legacy.httparchive.org
Other
328 stars 84 forks source link

Privacy 2021 custom metrics #211

Closed max-ostapenko closed 3 years ago

max-ostapenko commented 3 years ago

Progress on https://github.com/HTTPArchive/almanac.httparchive.org/issues/2149

max-ostapenko commented 3 years ago

@VictorLeP could you add to-do items to the description of this PR? I've added some that you've completed, but how much is left?

VictorLeP commented 3 years ago

@VictorLeP could you add to-do items to the description of this PR? I've added some that you've completed, but how much is left?

I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')?

VictorLeP commented 3 years ago

@VictorLeP could you add to-do items to the description of this PR? I've added some that you've completed, but how much is left?

I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')?

I'm guessing this is due to the same reason as me not being able to mark the pull request as ready for review: "Only those with write access to this repository can mark a draft pull request as ready for review."

tunetheweb commented 3 years ago

You should have full access now @VictorLeP . Can you try again.

VictorLeP commented 3 years ago

Thanks, I don't see an edit option for https://github.com/HTTPArchive/legacy.httparchive.org/pull/211#issue-674074013 yet though. (Just to check I'm looking in the right place: I would expect an 'Edit' item in the menu under the three dots top right of that comment.)

tunetheweb commented 3 years ago

Ah looks like this repo is set up differently to the main Web Almanac one so you can’t edit each other’s comments unfortunately.

VictorLeP commented 3 years ago

OK, thanks for checking!

VictorLeP commented 3 years ago

@max-ostapenko These are the changes to the to-do list I think should be made:

VictorLeP commented 3 years ago

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

max-ostapenko commented 3 years ago

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

I thought it would be enough to verify the presence of ClientJS object?

VictorLeP commented 3 years ago

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

I thought it would be enough to verify the presence of ClientJS object?

Sure, but a custom metric is still needed for that if it's not added to Wappalyzer :)

tunetheweb commented 3 years ago

You can open a PR with Wappalyzer if you know how to detect it. And that way everyone gets to benefit from it and not just the Web Almanac.

Here's the PR for FingerprintJs for example: https://github.com/AliasIO/wappalyzer/pull/1790 and a follow up one to improve on that: https://github.com/AliasIO/wappalyzer/pull/3503

rviscomi commented 3 years ago

Be aware that if you're relying on the detection being implemented in Wappalyzer, your PR would need to be merged with them before July 1. If the PR is delayed for any reason, the metric may need to be implemented here as a custom metric anyway.

max-ostapenko commented 3 years ago

Next release of wappalyser is in about 2 weeks. @tunetheweb do you know how much time it takes to be used on our crawling servers?

tunetheweb commented 3 years ago

We normally sync it up just before the web almanac run to get latest version.

pmeenan commented 3 years ago

FWIW, we sync the code to master and don't need to wait for an official release.

VictorLeP commented 3 years ago

FYI: PRs https://github.com/AliasIO/wappalyzer/pull/4048 (geolocation) and https://github.com/AliasIO/wappalyzer/pull/4050 (fingerprinting) are pending for adding libraries to Wappalyzer.

VictorLeP commented 3 years ago

I've added detections of tags related to referrer policies (because these can also be set in HTML) and accesses to the featurePolicy property. (these could be considered two additional tasks for the TODO list up top, that have been completed)

max-ostapenko commented 3 years ago

@pmeenan There may be multiple 'Origin-Trials' headers with tokens. How can I parse multiple headers values from $WPT_REQUESTS[0].response_headers which is an object (must have unique keys)? Please, is it supported in any way?

pmeenan commented 3 years ago

@max-ostapenko @rviscomi may know better if any processing has been done, but looking at the raw HARs there are headers in a few places but none are objects indexed by the header name:

response.headers is an array of name/value objects:

Screen Shot 2021-06-25 at 6 06 41 PM

_headers.response is an array of strings with the full name: key header:

Screen Shot 2021-06-25 at 6 07 02 PM

I think the summary tables have headers that were pulled together by the HA processing code. In that case it combines the values of multiple headers with the same key by comma-delimiting them which is what the RFC requires but won't match exactly what came over the wire.

max-ostapenko commented 3 years ago

@VictorLeP please could you check the origin-trials parsing and data structure? I decided to include both meta tag and header options into the custom metric.

max-ostapenko commented 3 years ago

@pmeenan did you consider changing the structure of $WPT_REQUESTS.response_headers similar to HAR's response.headers? (object to array)

pmeenan commented 3 years ago

Might be better to see if Rick can create a better summary table from the HARs. The current requests summary data is from the legacy SQL dumps and don't have a lot of flexibility.

On Sun, Jun 27, 2021 at 2:45 PM Max Ostapenko @.***> wrote:

@pmeenan https://github.com/pmeenan did you consider changing the structure of $WPT_REQUESTS.response_headers https://docs.webpagetest.org/custom-metrics/#accessing-requests-data-(chrome-only) similar to HAR's response.headers? (object to array)

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/legacy.httparchive.org/pull/211#issuecomment-869208229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBPDRFTNGGCCC3Z3U5TTU5WTXANCNFSM47AIXUSA .

VictorLeP commented 3 years ago

@VictorLeP please could you check the origin-trials parsing and data structure? I decided to include both meta tag and header options into the custom metric.

Looks good to me! A small test suggests that in the case of multiple headers with the same key, the values are collapsed into a comma-separated string in the object, so we don't lose any data there.

VictorLeP commented 3 years ago

Depending on https://github.com/HTTPArchive/almanac.httparchive.org/issues/2211, we might still need one more custom metric for .well-known/gpc.json, but that shouldn't be too hard. Otherwise, I think everything is done :)

rviscomi commented 3 years ago

Might be better to see if Rick can create a better summary table from the HARs. The current requests summary data is from the legacy SQL dumps and don't have a lot of flexibility.

Yeah this raises a good question about how much work we should be putting into the custom metrics versus post-processed summarization and where to draw the line for the best place for specific types of metrics.

But to @max-ostapenko's point, it seems like the structure of the headers made available to custom metrics is different from the HAR structure. There's a question of whether we should even do this work in the custom metric (maybe?) but I think parity with the HAR would be good and it'd help with processing repeatable headers, if needed.

@max-ostapenko @VictorLeP is this PR ready for review? Please also add a few test cases to show it working in WPT.

pmeenan commented 3 years ago

Argh, sorry - I didn't realize this was in the custom metrics - I thought it was pulling from the summary data.

The standard headers from the dev tools events are in key/value dictionary form: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-Response

There is headersText which will have the raw headers but that only works for http 1.x responses so it doesn't seem to be worth trying to fall back to.

The custom metrics don't have access to the netlog-processed headers which is what we use for the most accurate data for the HARs so I don't know if we're going to be able to do better for custom metrics.

GJFR commented 3 years ago

Depending on HTTPArchive/almanac.httparchive.org#2211, we might still need one more custom metric for .well-known/gpc.json, but that shouldn't be too hard. Otherwise, I think everything is done :)

I've made a well-known.js as discussed in https://github.com/HTTPArchive/almanac.httparchive.org/issues/2211. You should be able to extend it to check off .well-known/gpc.json.

max-ostapenko commented 3 years ago

Please also add a few test cases to show it working in WPT.

@rviscomi We've added test websites in the comments, is that what you mean?

@pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table.

VictorLeP commented 3 years ago

Please also add a few test cases to show it working in WPT.

@rviscomi We've added test websites in the comments, is that what you mean?

I think he means provide links to actual tests on https://www.webpagetest.org/, I'll try to get to those today.

@pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table.

It's indeed a bit annoying that we cannot do the whole extraction upfront to have clean data already in the requests table, but must merge the meta tag and header values in the SQL query. (Given that the literal definition of <meta http-equiv="..."> is to simulate HTTP headers, I could imagine that there are more cases where the same issue arises.) @max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

max-ostapenko commented 3 years ago

@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric. Then we'll be able to verify if there is any value to go the hard way (compared to HAR data) and will adjust it to UDF. What do you think?

VictorLeP commented 3 years ago

@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric. Then we'll be able to verify if there is any value to go the hard way (compared to HAR data) and will adjust it to UDF. What do you think?

Sounds excellent, it'll be interesting to compare the two sources and see if there is any meaningful difference.

VictorLeP commented 3 years ago

WebPageTest runs:

max-ostapenko commented 3 years ago

@VictorLeP @rviscomi seems we are ready for merge