Closed max-ostapenko closed 3 years ago
@VictorLeP could you add to-do items to the description of this PR? I've added some that you've completed, but how much is left?
@VictorLeP could you add to-do items to the description of this PR? I've added some that you've completed, but how much is left?
I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')?
@VictorLeP could you add to-do items to the description of this PR? I've added some that you've completed, but how much is left?
I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')?
I'm guessing this is due to the same reason as me not being able to mark the pull request as ready for review: "Only those with write access to this repository can mark a draft pull request as ready for review."
You should have full access now @VictorLeP . Can you try again.
Thanks, I don't see an edit option for https://github.com/HTTPArchive/legacy.httparchive.org/pull/211#issue-674074013 yet though. (Just to check I'm looking in the right place: I would expect an 'Edit' item in the menu under the three dots top right of that comment.)
Ah looks like this repo is set up differently to the main Web Almanac one so you can’t edit each other’s comments unfortunately.
OK, thanks for checking!
@max-ostapenko These are the changes to the to-do list I think should be made:
Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS
).
Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g.,
ClientJS
).
I thought it would be enough to verify the presence of ClientJS
object?
Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g.,
ClientJS
).I thought it would be enough to verify the presence of
ClientJS
object?
Sure, but a custom metric is still needed for that if it's not added to Wappalyzer :)
You can open a PR with Wappalyzer if you know how to detect it. And that way everyone gets to benefit from it and not just the Web Almanac.
Here's the PR for FingerprintJs for example: https://github.com/AliasIO/wappalyzer/pull/1790 and a follow up one to improve on that: https://github.com/AliasIO/wappalyzer/pull/3503
Be aware that if you're relying on the detection being implemented in Wappalyzer, your PR would need to be merged with them before July 1. If the PR is delayed for any reason, the metric may need to be implemented here as a custom metric anyway.
Next release of wappalyser is in about 2 weeks. @tunetheweb do you know how much time it takes to be used on our crawling servers?
We normally sync it up just before the web almanac run to get latest version.
FWIW, we sync the code to master and don't need to wait for an official release.
FYI: PRs https://github.com/AliasIO/wappalyzer/pull/4048 (geolocation) and https://github.com/AliasIO/wappalyzer/pull/4050 (fingerprinting) are pending for adding libraries to Wappalyzer.
I've added detections of tags related to referrer policies (because these can also be set in HTML) and accesses to the featurePolicy
property. (these could be considered two additional tasks for the TODO list up top, that have been completed)
@pmeenan There may be multiple 'Origin-Trials' headers with tokens. How can I parse multiple headers values from $WPT_REQUESTS[0].response_headers which is an object (must have unique keys)? Please, is it supported in any way?
@max-ostapenko @rviscomi may know better if any processing has been done, but looking at the raw HARs there are headers in a few places but none are objects indexed by the header name:
response.headers is an array of name/value objects:
_headers.response is an array of strings with the full name: key header:
I think the summary tables have headers that were pulled together by the HA processing code. In that case it combines the values of multiple headers with the same key by comma-delimiting them which is what the RFC requires but won't match exactly what came over the wire.
@VictorLeP please could you check the origin-trials parsing and data structure? I decided to include both meta tag and header options into the custom metric.
@pmeenan did you consider changing the structure of $WPT_REQUESTS.response_headers similar to HAR's response.headers? (object to array)
Might be better to see if Rick can create a better summary table from the HARs. The current requests summary data is from the legacy SQL dumps and don't have a lot of flexibility.
On Sun, Jun 27, 2021 at 2:45 PM Max Ostapenko @.***> wrote:
@pmeenan https://github.com/pmeenan did you consider changing the structure of $WPT_REQUESTS.response_headers https://docs.webpagetest.org/custom-metrics/#accessing-requests-data-(chrome-only) similar to HAR's response.headers? (object to array)
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/legacy.httparchive.org/pull/211#issuecomment-869208229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBPDRFTNGGCCC3Z3U5TTU5WTXANCNFSM47AIXUSA .
@VictorLeP please could you check the origin-trials parsing and data structure? I decided to include both meta tag and header options into the custom metric.
Looks good to me! A small test suggests that in the case of multiple headers with the same key, the values are collapsed into a comma-separated string in the object, so we don't lose any data there.
Depending on https://github.com/HTTPArchive/almanac.httparchive.org/issues/2211, we might still need one more custom metric for .well-known/gpc.json
, but that shouldn't be too hard. Otherwise, I think everything is done :)
Might be better to see if Rick can create a better summary table from the HARs. The current requests summary data is from the legacy SQL dumps and don't have a lot of flexibility.
Yeah this raises a good question about how much work we should be putting into the custom metrics versus post-processed summarization and where to draw the line for the best place for specific types of metrics.
But to @max-ostapenko's point, it seems like the structure of the headers made available to custom metrics is different from the HAR structure. There's a question of whether we should even do this work in the custom metric (maybe?) but I think parity with the HAR would be good and it'd help with processing repeatable headers, if needed.
@max-ostapenko @VictorLeP is this PR ready for review? Please also add a few test cases to show it working in WPT.
Argh, sorry - I didn't realize this was in the custom metrics - I thought it was pulling from the summary data.
The standard headers from the dev tools events are in key/value dictionary form: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-Response
There is headersText which will have the raw headers but that only works for http 1.x responses so it doesn't seem to be worth trying to fall back to.
The custom metrics don't have access to the netlog-processed headers which is what we use for the most accurate data for the HARs so I don't know if we're going to be able to do better for custom metrics.
Depending on HTTPArchive/almanac.httparchive.org#2211, we might still need one more custom metric for
.well-known/gpc.json
, but that shouldn't be too hard. Otherwise, I think everything is done :)
I've made a well-known.js
as discussed in https://github.com/HTTPArchive/almanac.httparchive.org/issues/2211. You should be able to extend it to check off .well-known/gpc.json
.
Please also add a few test cases to show it working in WPT.
@rviscomi We've added test websites in the comments, is that what you mean?
@pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table.
Please also add a few test cases to show it working in WPT.
@rviscomi We've added test websites in the comments, is that what you mean?
I think he means provide links to actual tests on https://www.webpagetest.org/, I'll try to get to those today.
@pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table.
It's indeed a bit annoying that we cannot do the whole extraction upfront to have clean data already in the requests table, but must merge the meta tag and header values in the SQL query. (Given that the literal definition of <meta http-equiv="...">
is to simulate HTTP headers, I could imagine that there are more cases where the same issue arises.)
@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.
@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.
@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric. Then we'll be able to verify if there is any value to go the hard way (compared to HAR data) and will adjust it to UDF. What do you think?
@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.
@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric. Then we'll be able to verify if there is any value to go the hard way (compared to HAR data) and will adjust it to UDF. What do you think?
Sounds excellent, it'll be interesting to compare the two sources and see if there is any meaningful difference.
@VictorLeP @rviscomi seems we are ready for merge
Progress on https://github.com/HTTPArchive/almanac.httparchive.org/issues/2149