HTTPArchive / bigquery

BigQuery import and processing pipelines
67 stars 20 forks source link

Integrate Wappalyzer platform detection #19

Closed rviscomi closed 6 years ago

rviscomi commented 7 years ago

Implementation of https://github.com/HTTPArchive/httparchive/issues/90

cc @AliasIO

rviscomi commented 7 years ago

Now that we've managed to update our dataflow pipeline to version 1.9, I can begin implementing this feature. Some requirements:

Here's a rough outline of the expected approach:

The manual unzipping code changes were added into the beginning of the split-har DoFn. I think this should be walked back to be more like how the read-har step behaved, so that the HAR unzipping/decoding happens before (and separate from) the split-har step. This should enable the platform detection to work on the HAR data in parallel with the LH/pages/requests/bodies table generation. Arbitrary additional parallel processes could be added at this step. See below for proposed changes:

image

We can limit the detection to the CMS category -- or just Wordpress -- for an MVP. The results table might look something like this:

url category name isImplied
example.com CMS Wordpress FALSE
example.com Programming Languages PHP TRUE
example.com JavaScript Frameworks jQuery TRUE
igrigorik commented 7 years ago

Why not annotate the entries or page level object directly? Generally speaking, joins are much slower and expensive, and life is much simpler if we surface this data directly on the relevant JSON record? I'm imagining a future where we have a dozen such annotations: if each of them is in a different table, the joins can get annoying and expensive pretty quickly?

An alternative structure here would be unzip > [ inject annotators ] > split > write-? As a bonus, we don't have to worry about per-annotator schemas and split > write- steps can remain unchanged, as the annotation results would be JSON encoded in payload fields.

WDYT?

rviscomi commented 7 years ago

Ahh ok yeah I like your idea. We don't quite get the parallelization but you're right that we should be optimizing for the querying experience not necessarily pipeline performance.

In the pages table, the payload JSON object would look like this:

{
  //...
  annotators: {
    wappalyzer: [
      {"category": "CMS", "name": "Wordpress", "isImplied": false} //...
    ]
  }
  //...
}

isImplied could also more richly describe how it was implied, eg {..., "name": "PHP", "impliedFrom": ["Wordpress"]}

This is a relatively small table so we shouldn't run into any row size limitations.

igrigorik commented 7 years ago

What do we give up on the parallelization? We can still ParDo over all the records for each annotator, and run all the annotators in parallel as well, right? I guess the only gotcha is that advancing to the write step will be blocked on all the annotators finishing (we'll fan out, and will then need to converge on split step before write). Does that sound about right?

Re, format: yep, something like that.

rviscomi commented 7 years ago

Originally I was picturing the annotation happening during write-pages, but I see what you mean instead. Before splitting the HAR, we run it through an annotator DoFn that injects the new annotators object. Then we pipe the modified HAR file to split-har, which is none the wiser about the foreign object in the pages output.

This also means we get to revive the old HarDecoder since we'll be passing the HAR file around between DoFn's. *\~gone but not forgotten\~*

rviscomi commented 7 years ago

I've made the dataflow changes described above in https://github.com/HTTPArchive/bigquery/pull/20. Still need to do the Wappalyzer specific stuff.

One thing I'm not sure of is how the Wappalyzer JSON config will be made available to the AnnotateFn. I think I can use TextIO.Read to pull the config into the pipeline and do some kind of dataflow join to pass both the config and the HAR file into the annotator.

Maybe since both inputs are JSON, I can stuff one inside the other. Similar to adding the annotations to the page object, I can add the Wappalyzer config to the HAR itself. Or just wrap both of them in a JSON object: {wappalyzerConfig: {...}, har: {...}}. Getting kind of kludgy at this point but it simplifies things.

rviscomi commented 7 years ago

Ilya had mentioned off-thread that there's another approach, which is to use side inputs. AFAICT this will end up loading the config file into memory for each worker, as opposed to loading once and passing into each worker as main input. This is probably fine since the config file is ~200KB and worker memory is abundant at 15GB. This also seems to be the more semantic way to inject helper data dynamically.

I've also completed part 2 of the PR, which is the script to sync the config file from GitHub to GCS.

rviscomi commented 6 years ago

Update on this.

I'm exploring a solution outside of Dataflow, where the Wappalyzer signals are serialized in a BigQuery table and we do a special kind of query that joins the Wappalyzer signals with the HTTP Archive data. For example, we can query for all the WordPress signals:

image

Then for each app that is or implies WordPress (there are no indirect/transitive implications in Wappalyzer) we can query for all URLs that match any of these signals. Finally, the output is written to a table.

This read/write process could be automated to occur at the end of each crawl using a PubSub subscription that:

Here is a script I wrote to massage the raw Wappalyzer data into a BigQuery-friendly format with only the fields we care about. We can have a separate process for syncing this with newer upstream versions of Wappalyzer.

https://gist.github.com/rviscomi/d91bbb665e4592811e051c4e180bb62b#file-wappalyzerbigquery-js-L1-L50

igrigorik commented 6 years ago

Then for each app that is or implies WordPress (there are no indirect/transitive implications in Wappalyzer) we can query for all URLs that match any of these signals. Finally, the output is written to a table.

If we want to expose the full Wappalyzer classification for each run — i.e. not limited to a select subset we defined ahead of time — how does this scale? We would need to run each individual rule as BQ query? How long would that take and what does this setup win us over doing this as an extra batch step in dataflow?

rviscomi commented 6 years ago

If it's not feasible (due to time or resource constraints) to classify all apps in one query -- writing that query alone would be an accomplishment -- then we can run one query per app. Granted, it's not a distributed/concurrent solution like Dataflow, but being that this is running asynchronously at the end of the crawl, I don't see any scalability concerns. For example, here's a query that classifies all URLs directly or indirectly as Wordpress:

SELECT
  url,
  name
FROM
  `summary_pages.2018_02_15_desktop`
JOIN
  `response_bodies.2018_02_15_desktop`
USING
  (url), (
  SELECT
    name,
    html
  FROM
    `httparchive.wappalyzer.apps`,
    UNNEST(implies) AS implies,
    UNNEST(html) AS html
  WHERE
    name = 'WordPress'
    OR implies = 'WordPress')
WHERE
  REGEXP_CONTAINS(body, html)
GROUP BY
  url,
  name

It processes 913 GB in 15.6 seconds. Assuming it takes this long for all 1061 apps, it will take about 4.5 hours in total if run serially. BigQuery also supports async queries, so if 4.5 hours is too long we can investigate that to run the queries in parallel.

what does this setup win us over doing this as an extra batch step in dataflow?

Having the app rules accessible in BigQuery allows developers to see exactly why a site was classified the way it was and reproduce the results themselves. The Dataflow job also does a lot of other post-processing and this compartmentalizes the work so we can run it interactively. It's also nice that the classification queries themselves are good demonstrations of how to examine HTTP Archive data, but now I'm reaching ;)

pmeenan commented 6 years ago

Why don't we just integrate the full Wappalyzer script into the WebPageTest agent directly? That won't give us the ability to retroactively detect things for previous runs but presumably neither will doing it in the dataflow.

It wouldn't plug in all that well as a custom metric because of the async's and awaits but it may not be too painful to integrate the code as-is into the WebPageTest agent directly if I also teach the agent how to run and wait for the results.

Maintenence just becomes a matter of merging the latest version of the code as-is and I'm guessing the data would be useful for a lot of use cases for WPT outside of just the HTTP Archive.

igrigorik commented 6 years ago

For example, here's a query that classifies all URLs directly or indirectly as Wordpress...

That query is incomplete, correct? Because there are also meta + script and other rules we should be checking for.

It processes 913 GB in 15.6 seconds. Assuming it takes this long for all 1061 apps, it will take about 4.5 hours in total if run serially. BigQuery also supports async queries, so if 4.5 hours is too long we can investigate that to run the queries in parallel.

TBH, it seems a little odd to me to replicate a parallel fanout powered by BQ when DataFlow is optimized precisely for such tasks: it also requires setting up pubsub channels, creating a job manager workflow to spawn all the tasks, etc.

Having the app rules accessible in BigQuery allows developers to see exactly why a site was classified the way it was and reproduce the results themselves.

That's fair. Realistically though, I wouldn't expect anyone to run this code themselves — it's too costly. The WP analysis alone would eat up your entire free tier?

The Dataflow job also does a lot of other post-processing and this compartmentalizes the work so we can run it interactively.

I'm dubious of the interactively benefits. One benefit of this approach is that we can defer the classification until someone needs it — this is a nice property — but practically speaking it (a) consumes a lot of resources (cost), and (b) does not enable ad-hoc analysis where we want to slice and dice by all the different categories and apps. Which is to say, I'd like to see the full analysis results be precomputed and available, I think the entry barrier created by on-demand is something I'd like to avoid.

To be clear, I think what we have hear is a really nice proof of concept, and I think we could make it work. I'm just not clear that what we're optimizing for in this approach is aligned with what we're after?

Why don't we just integrate the full Wappalyzer script into the WebPageTest agent directly? That won't give us the ability to retroactively detect things for previous runs but presumably neither will doing it in the dataflow.

@pmeenan it does, with Dataflow we can reprocess any set of past HARs to regenerate the tables and include the Wappalyzer analysis — that's the main feature I'm after; it allows us to rerun past year of crawls and see how the stats have changed. Also, in the future, if we update or improve the logic we can reprocess to improve accuracy.

rviscomi commented 6 years ago

That query is incomplete, correct? Because there are also meta + script and other rules we should be checking for.

Correct, there are other signals that need to be checked. Timing estimates are a lower bound.

TBH, it seems a little odd to me to replicate a parallel fanout powered by BQ when DataFlow is optimized precisely for such tasks: it also requires setting up pubsub channels, creating a job manager workflow to spawn all the tasks, etc.

To be clear, I think what we have hear is a really nice proof of concept, and I think we could make it work. I'm just not clear that what we're optimizing for in this approach is aligned with what we're after?

To me this problem is related to generating reports for the beta website: we have a set of ~50 queries that run when the crawl is done, the results of which are packaged up for later exploration. The planned solution for that is to use PubSub to kick off the generation after the crawl is available in BQ. In this case, the output is another BQ table rather than JSON files in Cloud Storage.

One thing I haven't thought too much about yet is the effect of increasing capacity on BQ. What kinds of queries would suddenly fail on a 6 TB table?

I'll concede that Dataflow may be the the optimal solution, but having a BigQuery solution feels like a step in the right direction. It may not be optimal, but it gives us something to play with right now. In the discussion around capacity expansion, one principle we agreed on is that it's better to make incremental progress (rolling out CrUX origins in stages) rather than waiting for the optimal solution.

That's fair. Realistically though, I wouldn't expect anyone to run this code themselves — it's too costly. The WP analysis alone would eat up your entire free tier?

Well the use case I had in mind is "hey this site shouldn't be classified as WordPress, how did that happen?" So querying for a single page's response_bodies will be much more affordable.

with Dataflow we can reprocess any set of past HARs to regenerate the tables and include the Wappalyzer analysis — that's the main feature I'm after; it allows us to rerun past year of crawls and see how the stats have changed. Also, in the future, if we update or improve the logic we can reprocess to improve accuracy.

I agree with all of this. If we do go with a Dataflow solution, I'd like to find a way to avoid the need to regenerate tables any time we're looking to rerun the platform detection. eg two pipelines that can be initiated independently, but are run serially by default. (the compartmentalization point earlier)

--

Another significant benefit of post-processing with BQ that I forgot to mention earlier is extensibility. I'd love if people could contribute more jobs like this to enhance the usefulness of the BQ data. But honestly, Dataflow is a significantly higher barrier to entry than BQ. I maintain it and I'm still averse to touching it. It's written in a language used nowhere else in the project, so context switching into Java-mode takes time for me personally, and it's a core part of the data pipeline, so a bug there could be fatal. I can't imagine how someone with little HTTP Archive expertise would have the confidence to alter such a major part of the pipeline. On the other hand, I'm attracted to the idea that if you can write a query, you could create a new extension.

igrigorik commented 6 years ago

To me this problem is related to generating reports for the beta website: we have a set of ~50 queries that run when the crawl is done, the results of which are packaged up for later exploration. The planned solution for that is to use PubSub to kick off the generation after the crawl is available in BQ. In this case, the output is another BQ table rather than JSON files in Cloud Storage.

I see, so your argument is that we already need to do a bunch of BQ powered post-processing jobs, and this analysis could be added to that set — that's fair. The things that worry me here are:

  1. Wappalyzer will add many hundreds of new queries that we'll need to execute, and I'm wondering if and how this will scale, plus the overhead compute costs of this model vs. batch-optimized pipeline.
  2. Outputs of BQ jobs would need to be separate tables, since we can't update the input in place. Of course, you can join against the output tables, but this adds a lot more query complexity and inflates costs. I'm trying to find older threads (but my search-fu is failing me) on this but we had feedback that joining "alexa rank" data against HARs was both much slower and significantly more costly. Having this data as part of the table makes things much easier.

Neither of the above are showstoppers. If we already need to setup the pubsub channel and implement that workflow, then prototyping Wappalyzer as part of that pipeline is a reasonable first step. Worst case, we find that it doesn't scale, and that point we can come back to other options. The second point, however, is more tricky to evaluate downstream cost of.

One thing I haven't thought too much about yet is the effect of increasing capacity on BQ. What kinds of queries would suddenly fail on a 6 TB table?

Not sure I follow. Why would our tables be that large?

I agree with all of this. If we do go with a Dataflow solution, I'd like to find a way to avoid the need to regenerate tables any time we're looking to rerun the platform detection. eg two pipelines that can be initiated independently, but are run serially by default. (the compartmentalization point earlier)

It's not obvious to me that spinning up parallel pipelines is actually significantly more efficient.. A big chunk of pipeline time is spent in bringing data in and out storage, and DF does a great job of merging steps in larger flows to minize these costs.


I think I understand your motivation better now. There are tradeoffs on both ends here, and I think we can make good cases for either one. We can take a run at the BQ approach you're suggesting.. I think it'll get us where we want to be; worst case, we can backtrack and go with Dataflow.

Practical question: how far is the pubsub pipeline you're describing from reality? :)

rviscomi commented 6 years ago

Not sure I follow. Why would our tables be that large?

The current response_bodies tables are ~913 GB. If/When we increase to CrUX-like capacity, we'd be going from 500k URLs to 3M URLs, or ~6x, so the response_bodies tables would increase proportionally.

It's not obvious to me that spinning up parallel pipelines is actually significantly more efficient.

Sorry, to clarify I meant the ability to say "rerun Wappalyzer for these older crawls" without doing the other pipeline steps of creating the regular HAR tables. IOW, starting a new pipeline at "step 3" (HAR tables already exist) rather than "step 1" (generate HAR tables from HAR files).

Practical question: how far is the pubsub pipeline you're describing from reality? :)

Hope to have it working this week!

igrigorik commented 6 years ago

The current response_bodies tables are ~913 GB. If/When we increase to CrUX-like capacity, we'd be going from 500k URLs to 3M URLs, or ~6x, so the response_bodies tables would increase proportionally.

Gotcha. Well, I'm pretty sure BQ is setup to process much larger datasets then that. :)

Hope to have it working this week!

Awesome. Full speed ahead, sir! 👏

rviscomi commented 6 years ago

Update on the pubsub pipeline: https://github.com/HTTPArchive/beta.httparchive.org/issues/24

Summary: I identified 3 possible architectural solutions: Compute Engine, Cloud Functions, and App Engine. GAE seems like the best approach so I described how I'd use it to build out the pipeline. LMK if you have any comments on that design, otherwise I'll get started on it. Was hoping to have it built already but felt that it needed more thoughtful consideration first.

igrigorik commented 6 years ago

Left a few questions on https://github.com/HTTPArchive/beta.httparchive.org/issues/24#issuecomment-372001463 :)

pmeenan commented 6 years ago

FWIW, I just integrated Wappalyzer into the agents so the HAR's will have an entry in the page-level data starting with the April 1 crawl:

"_detected": {
    "Programming Languages": "PHP",
    "JavaScript Frameworks": "Lightbox,jQuery 2.2.14,jQuery Migrate 1.4.1",
    "Blogs": "WordPress",
    "Web Servers": "Nginx 1.10.3",
    "Font Scripts": "Font Awesome,Google Font API",
    "Photo Galleries": "NextGEN Gallery 2.2.14",
    "CMS": "WordPress",
    "Operating Systems": "Ubuntu"
},
igrigorik commented 6 years ago

@pmeenan awesomesauce.

rviscomi commented 6 years ago

@pmeenan thank you that's amazing and saves me a lot of work. I love that it's integrated upstream in WPT so there's no extra processing to make it available in the HAR. A couple of questions on your integration:

I suspect there's only so much parsing we can do from the raw output. But again, thank you!

pmeenan commented 6 years ago

The confidence factors are not included but I could potentially expose the raw details. I have it formatted the way it is to make it easier to query in various places so would could do a LIKE "%jQuery%" and extract everything that includes jQuery. I can also have it match the LH pattern, just feels like that is harder to query.

I have access to all of the individual fields for each app detected so it's mostly just how best to format it for consumption.

For other use cases I can just extract detected.CMS and know that it will be a string containing all of the detected CMS' and just drop it into a database.

pmeenan commented 6 years ago

btw, the confidence factors aren't weighed because I saw a fair number of the detections that would report 0 even though it was clearly detected (like the OS for example).

rviscomi commented 6 years ago

SGTM!

igrigorik commented 6 years ago

Hold the press :-) I realize we're duplicating work, but I'd still like to see Wappalyzer integration within HA pipeline itself, so we can run it against previous runs and rerun it in the future as detection improves.

pmeenan commented 6 years ago

Yep, sorry - didn't mean to confuse. I needed/wanted Wappalyzer in WebPageTest itself for non-HA testing. Let me know if there is anything else you need in the HAR's to be able to run it as part of the pipeline (the local detection uses access to the DOM to query some things for example so may have better coverage unless you also have a serialized DOM).

rviscomi commented 6 years ago

Yeah true we still need to backfill 😁. Pat, we could use existing page metadata (response body, headers) against which the Wappalyzer signals are tested. Adding new stuff to the HAR doesn't help the existing records, so nothing further needed there AFAIK.

But having the detection in WPT does mean that the backfilling should be a one-time job. For example, we can run the Wappalyzer detection across existing crawls and save the results to a new table. Then for future crawls we simply need to parse the HARs to get the detections.

rviscomi commented 6 years ago

Now that we have Wappalyzer data in the pages tables, I've been thinking about how to merge this with the historical detections we do manually in BQ.

Two approaches come to mind:

  1. Backfill and augment all tables between 2016_01_01 and 2018_04_01 to contain exactly the same $._detected and $._detected_apps HAR objects.
  2. Create a new wappalyzer dataset and maintain YYYY_MM_DD tables containing page URL, app, and category fields.

Pros and cons of each:

Backfill and augment (+) wappalyzer data all in one place (pages dataset) in one consistent format (JSON object) (+) one-time job to backfill historical data, new data included automatically by WPT (-) requires JSON_EXTRACT functions to get at data, relatively expensive query cost

New dataset

For example, the table in CSV format might look like this:

page,app,category
example.com,WordPress,CMS

(+) super simple query interface (get all pages that are WordPress, or get all CMS apps, or get all CMS pages), relatively inexpensive query cost (execution time, resources) (+) could potentially also store JS library detector results (-) requires pipeline modification for all existing and future crawls to put data into new dataset


I'm leaning towards the new dataset approach, mostly for the improved user experience for simplified querying. But not sure if there's already any dataset fatigue left over from the recent breakup of har/runs. Curious to hear any other thoughts or suggestions.

paulcalvano commented 6 years ago

I like the new dataset approach. If a new table is created, can we also include pageid and url columns as well so it's easier to join with the summary tables?

So for example:

pageid,page,app,category
1234,https://www.example.com,WordPress,CMS
rviscomi commented 6 years ago

Cool, thanks for your thoughts Paul.

Re: joining tables, wouldn't you be able to join on the page url?

paulcalvano commented 6 years ago

Yes, That should be fine.

If the url strings don’t match exactly then we can always use Net.Host() on the URL’s in both tables (which is what I do with CrUX to join against HA tables).

igrigorik commented 6 years ago

Two cents from the peanut gallery: we should build a solution that allows us to retroactively reprocess previous runs at will and with minimum pain. Wappalyzer logic continues to improve and I'm certain that we'll run into edge cases where we find current logic to be insufficient and may want to improve it. Once such improvements land, we should make it easy for ourselves to reprocess previous data.

While separate dataset is not without its tradeoffs (joins are not free either), I think it's a better long-term strategy.

rviscomi commented 6 years ago

I've been toiling over a query that could simulate what Wappalyzer does in the browser runtime context, except using crawl artifacts like response bodies and HAR headers. I've still got some adjustments to make it work for all apps -- as of now it takes ~an hour then runs out of resources and fails. So I added a small WHERE clause to only detect WordPress and other apps that imply WordPress. This query processes 83 TB (!!! $415 !!!):

image

#standardSQL
# WARNING: Running this query costs $400. You will regret this!
CREATE TEMPORARY FUNCTION getHeader(headers STRING, name STRING) AS (
  REGEXP_EXTRACT(
    headers,
    CONCAT('{"name":"', name, '[^}]*'))
);

CREATE TEMPORARY FUNCTION hasHeader(headers STRING, name STRING, value STRING) AS (
  getHeader(headers, name) IS NOT NULL AND (
    value != '' OR
    REGEXP_CONTAINS(JSON_EXTRACT(getHeader(headers, name), '$.value'), value))
);

SELECT
  table,
  url,
  category,
  app,
  IFNULL(REGEXP_EXTRACT(REGEXP_EXTRACT(body, CONCAT('(?i)<(meta[^>]+name=[\'"]?', meta.name, '[^>]*)')), meta.value), '') AS info
FROM
  (SELECT url, body, _TABLE_SUFFIX AS table FROM `httparchive.response_bodies.*` WHERE url = page)
JOIN
  (SELECT url, JSON_EXTRACT(payload, '$.response.headers') AS headers, _TABLE_SUFFIX AS table FROM `httparchive.requests.*` WHERE url = page)
USING
  (url, table), (
  SELECT
    category,
    apps.name AS app,
    html,
    script,
    header,
    meta
  FROM
    `httparchive.wappalyzer.apps` AS apps
  LEFT JOIN
    UNNEST(categories) AS category
  LEFT JOIN
    UNNEST(html) AS html
  LEFT JOIN
    UNNEST(script) AS script
  LEFT JOIN
    UNNEST(headers) AS header
  LEFT JOIN
    UNNEST(meta) AS meta
  LEFT JOIN
    UNNEST(implies) AS implies
  WHERE
    apps.name = 'WordPress' OR
    implies = 'WordPress')
WHERE
  (html NOT LIKE '%?!%' AND REGEXP_CONTAINS(body, html)) OR
  REGEXP_CONTAINS(body, script) OR
  REGEXP_CONTAINS(REGEXP_EXTRACT(body, CONCAT('(?i)<(meta[^>]+name=[\'"]?', meta.name, '[^>]*)')), meta.value) OR
  hasHeader(headers, header.name, header.value)
GROUP BY
  table,
  url,
  category,
  app,
  body,
  meta.name,
  meta.value

Don't run that query. It's hecka expensive.

I saved the results to a 2.8 GB table: https://bigquery.cloud.google.com/table/httparchive:wappalyzer.wordpress

I compared the number of distinct WordPress URLs for the 2018_04_15 crawl as detected by BigQuery to the same stat as detected by Wappalyzer in WPT.

#standardSQL
# This is the processed table, so query is yes.
SELECT
  COUNT(DISTINCT url)
FROM
  `httparchive.wappalyzer.wordpress`
WHERE
  date = '2018_04_15' AND
  client = 'desktop'
  AND url NOT IN (
  SELECT
    DISTINCT url
  FROM
    `httparchive.apps.2018_04_15_desktop`
  WHERE
    app = 'WordPress')

BigQuery found 199,022 URLs using WordPress. Wappalyzer+WPT found 105,892 URLs. So there are ~95k URLs BigQuery detects as WordPress not detected by Wappalyzer+WPT.

Inverting the query to count the URLs detected by Wappalyzer+WPT not detected by BigQuery, there are much fewer: 1,842.

There are almost 2x as many detections than the runtime! That's weird. I sampled a few of the BQ-only URLs and they do seem to be valid WordPress. Because my query also includes URLs for detections that imply WordPress, there are some indirect positives in there. Excluding implications, the number of distinct URLs detected by BigQuery drops to 429. So Wappalyzer+WPT is likely not returning WordPress when its implications are detected. @pmeenan is this a bug or a feature?

Also the 199,022 detections by BigQuery are very likely to contain some false positives. I don't take Wappalyzer's confidence weights into consideration, so if an app even weakly implies WordPress and app is present, I assume WordPress is there too. We know global WordPress adoption is around 30% but this puts the ~460k pages crawled by HA at more like 43% WordPress.

That said, I'm getting close to an all-purpose query for all apps. Hope to wrap that up after conferencepalooza this month.

pmeenan commented 6 years ago

Would be a bug if implications aren't caught. Can you send over a few urls that bigquery detected that WPT did not?


From: Rick Viscomi notifications@github.com Sent: Saturday, May 5, 2018 4:09:22 PM To: HTTPArchive/bigquery Cc: Patrick Meenan; Mention Subject: Re: [HTTPArchive/bigquery] Integrate Wappalyzer platform detection (#19)

I've been toiling over a query that could simulate what Wappalyzer does in the browser runtime context, except using crawl artifacts like response bodies and HAR headers. I've still got some adjustments to make it work for all apps -- as of now it takes ~an hour then runs out of resources and fails. So I added a small WHERE clause to only detect WordPress and other apps that imply WordPress. This query takes ~5000 seconds and processes 83 TB (!!! $415 !!!):

standardSQL

WARNING: Running this query costs $400. You will regret this!

CREATE TEMPORARY FUNCTION getHeader(headers STRING, name STRING) AS ( REGEXP_EXTRACT( headers, CONCAT('{"name":"', name, '[^}]*')) );

CREATE TEMPORARY FUNCTION hasHeader(headers STRING, name STRING, value STRING) AS ( getHeader(headers, name) IS NOT NULL AND ( value != '' OR REGEXP_CONTAINS(JSON_EXTRACT(getHeader(headers, name), '$.value'), value)) );

SELECT table, url, category, app, IFNULL(REGEXP_EXTRACT(REGEXP_EXTRACT(body, CONCAT('(?i)<(meta[^>]+name=[\'"]?', meta.name, '[^>])')), meta.value), '') AS info FROM (SELECT url, body, _TABLE_SUFFIX AS table FROM `httparchive.response_bodies.WHERE url = page) JOIN (SELECT url, JSON_EXTRACT(payload, '$.response.headers') AS headers, _TABLE_SUFFIX AS table FROMhttparchive.requests.WHERE url = page) USING (url, table), ( SELECT category, apps.name AS app, html, script, header, meta FROM httparchive.wappalyzer.apps` AS apps LEFT JOIN UNNEST(categories) AS category LEFT JOIN UNNEST(html) AS html LEFT JOIN UNNEST(script) AS script LEFT JOIN UNNEST(headers) AS header LEFT JOIN UNNEST(meta) AS meta LEFT JOIN UNNEST(implies) AS implies WHERE apps.name = 'WordPress' OR implies = 'WordPress') WHERE (html NOT LIKE '%?!%' AND REGEXP_CONTAINS(body, html)) OR REGEXP_CONTAINS(body, script) OR REGEXP_CONTAINS(REGEXP_EXTRACT(body, CONCAT('(?i)<(meta[^>]+name=[\'"]?', meta.name, '[^>])')), meta.value) OR hasHeader(headers, header.name, header.value) GROUP BY table, url, category, app, body, meta.name, meta.value

Don't run that query. It's hecka expensive.

I saved the results to a 2.8 GB table: https://bigquery.cloud.google.com/table/httparchive:wappalyzer.wordpress

I compared the number of distinct WordPress URLs for the 2018_04_15 crawl as detected by BigQuery to the same stat as detected by Wappalyzer in WPT:

standardSQL

This is the processed table, so query is yes.

SELECT COUNT(DISTINCT url) FROM httparchive.wappalyzer.wordpress WHERE date = '2018_04_15' AND client = 'desktop' AND url NOT IN ( SELECT DISTINCT url FROM httparchive.apps.2018_04_15_desktop WHERE app = 'WordPress')

There are 95,000 URLs BigQuery detects as WordPress not detected by Wappalyzer+WPT.

Inverting the query to count the URLs detected by Wappalyzer+WPT not detected by BigQuery, there are much fewer: 1,842.

There are almost 2x as many detections than the runtime. That's weird. I sampled a few of the BQ-only URLs and they do seem to be valid WordPress. Because my query also includes URLs for detections that imply WordPress, there are some indirect positives in there. Excluding implications, the number of distinct URLs detected by BigQuery drops to 429. So Wappalyzer+WPT is likely not returning WordPress when its implications are detected. @pmeenanhttps://github.com/pmeenan is this a bug or a feature?

But the delta between detection methodologies seems to be tolerably small, so I'm getting close to an all-purpose query for all apps. Hope to wrap that up after conferencepalooza this month.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/HTTPArchive/bigquery/issues/19#issuecomment-386831520, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAbHBY8j-R_AEDjQm7maYFOcRxRqzGdXks5tvgbygaJpZM4OqNKf.

rviscomi commented 6 years ago
http://www.tonybuzan.com/
http://www.tsakirismallas.gr/
http://www.srmu.ac.in/
http://www.drforogh.com/
http://www.drmehrabian.com/
http://www.thearda.com/
http://www.marypaz.com/
http://www.xpock.com.br/
http://www.homedepotgardenclub.com/
http://www.bosanova.es/
#standardSQL
SELECT
  DISTINCT url
FROM
  `httparchive.wappalyzer.wordpress`
WHERE
  date = '2018_04_15' AND
  client = 'desktop'
  AND url NOT IN (
  SELECT
    DISTINCT url
  FROM
    `httparchive.apps.2018_04_15_desktop`
  WHERE
    app = 'WordPress')
LIMIT 10
pmeenan commented 6 years ago

Wappalyzer js was throwing an exception when the headers array was empty so detection was failing entirely for those pages. I forked the js and added some exception handling around the relevant parts and it is working now. Should be fixed in the next crawl.

The inferred apps get processed automatically so that's not an issue.

pmeenan commented 6 years ago

I stand (partially) corrected. THat was the problem with the first url. The others are running the detection fine but not detecting Wordpress (even implied). Digging in deeper now.

pmeenan commented 6 years ago

Debugging https://www.srmu.ac.in/ specifically, it looks like BQ found it because revslider implies WordPress but the actual definition for revslider looks like:

    "Revslider": {
      "cats": [
        19
      ],
      "icon": "revslider.png",
      "html": [
        "<link[^>]* href=[\\'\"][^']+revslider[/\\w-]+\\.css\\?ver=([0-9.]+)[\\'\"]\\;version:\\1"
      ],
      "script": "/revslider/[/\\w-]+/js",
      "implies": "WordPress",
      "website": "https://revolution.themepunch.com/"
    },

in the actual page code there is a script block that contains inline code:

jsFileLocation: "//server.local/revslider/wp-content/plugins/revslider/public/assets/js/",

The "script" definition block in Wappalyzer itself only runs against "src" urls of all of the script tags, not the actual script body. In at least this case, Wappalyzer itself doesn't really detect it but running the regex across the whole body happens to catch it inadvertently.

I need to look through more examples but running anything except for the html detections against the bodies will result in more detections than what Wappalyzer itself detects (some false positives, some valid detections). In this case the existing Wappalyzer detection rule needs some improvement to detect revslider correctly on the page.

I'm trying to validate that it doesn't really detect it through the Wappalyzer web UI but it isn't completing checks right now (claims it is too busy and has been for the last few hours).

pmeenan commented 6 years ago

Making some progress.

These are Magento, not Wordpress (the BQ query caught them as false positives):

http://www.tsakirismallas.gr/
https://www.bosanova.es/
http://www.marypaz.com/

These use "Slider Revolution" which is available in both a WordPress and JQuery form. The BQ query caught some code inside of an inline js function that incorrectly flagged it as the WordPress version:

http://www.srmu.ac.in/
http://www.thearda.com/

This page is hacked and does a javascript redirect to a scam site. The BQ query sees the original HTML before the redirect (WordPress) and the Wappalyzer scan in the agent sees the final HTML of the scam page (not WordPress):

http://www.xpock.com.br/

These pages have libraries that alter "Array" globally on the page in ways that break the Wappalyzer scan which is why WPT was not capturing results. Long-term it may make sence to port the Wappalyzer logic to python and run it outside of the browser. In the meantime I'm hardening the Wappalyzer code to handle the issues as I find them. If you know of a way to say "use the native Array for all of this code" it would be very welcome :-)

http://www.tonybuzan.com/
http://drmehrabian.com/

After I get the last 2 pages patched I'll look at some more of the differences because it is helping catch edge cases.

pmeenan commented 6 years ago

PR created for Wappalyzer with hardening to deal with the 2 cases where Array was extended. The WPT agents are already running the hardened version.

pmeenan commented 6 years ago

I ran through the first 30 that came back from the query you shared as being different and WPT matches what the Wappalyzer extension shows. I think the BQ regex is tripping a lot of false positives.

There were 3 Magento sites in the list that BQ flagged as WP:

http://www.artik.com/
http://www.glisse-proshop.com/
http://www.brain-effect.com/

There was one Drupal site:

http://www.quirk.biz/

One of the sites was hacked and redirecting (probably WP for the origin) but the majority of the remainder were not detected as being WP by the Wappalyzer extension either. It's possible some of them are but catching them by accident is probably not a good thing since it is using the Wappalyzer patterns.

I am pretty comfortable that WPT is being faithful to the Wappalyzer engine modulo any more edge cases where in-page JS breaks the auditing. It would be an interesting exercise to compare the list against builtwith and see if there is room to improve the Wappalyzer wordpress detection (pretty sure there is since it only looks for a couple of plugins).

igrigorik commented 6 years ago

So I added a small WHERE clause to only detect WordPress and other apps that imply WordPress. This query processes 83 TB (!!! $415 !!!)... That said, I'm getting close to an all-purpose query for all apps. Hope to wrap that up after conferencepalooza this month.

Wowza, that's concerning both in terms of cost and processing time. Rick, are you hinting that you might still be able to run all detects from within the single query and ~within similar cost footprint? If the outcome here is that we have to explicitly single out specific categories, then that makes me wonder if we ought to reconsider the Dataflow route.

rviscomi commented 6 years ago

Yeah I'm aiming for a single query that can do all detections going back to 2016. I think it's just a matter of optimizing the query and saving intermediate results.

rviscomi commented 6 years ago

Follow up on the previous comment. A single query isn't possible on its own due to resource limitations. See https://discuss.httparchive.org/t/really-big-queries-on-bigquery/1397 for documentation on how I split the query into 175 smaller jobs. If we ever need to rerun the Wappalyzer rules, eg if a new technology is added or a rule is changed, we'd need to do this process again.

rviscomi commented 6 years ago

Closing this as we now have a working pipeline with Wappalyzer detection built into WPT and a backfill of historical data generated by BigQuery. See the technologies dataset.