HTTPArchive / httparchive.org

The HTTP Archive website hosted on App Engine
https://httparchive.org
Apache License 2.0
336 stars 43 forks source link

Classification of the URLs in HA dataset #868

Open nrllh opened 5 months ago

nrllh commented 5 months ago

Hi all,

as previously announced in Slack, we wanted to classify the URLs, and we hope to have this solved soon. We classified over 110M different hostnames. In this issue, I want to give you an overview of the method we applied, and we have some discussion points to address.

How we classified and the results

We began by extracting distinct hostnames from all requests and extended the data with the domain name (eTLD+1) of these hostnames. This gave us a total of 110M distinct hostnames to classify. Using @yohhaan’s repository, which leverages the Google Topics API, we managed to classify 89% of the hostnames. 11% could not be classified due to the Topics API excluding adult sites (e.g., gambling). The dataset is currently hosted in my private project and can be queried:

SELECT * FROM misc-researches.ha.ha_categories

Points to discuss

  1. Some hostnames should not be classified individually. For example, googlesyndication.com has 32 million different hostnames like

The model's classification for these hostnames is very inconsistent, likely due to the hash values. Therefore, we should assign the classification of googlesyndication.com to all its hostnames. This issue also exists for other popular providers. I have created a sheet and manually assigned which popular sites with hashed subdomains should use the classification of the domain instead of the subdomain: Classification of sites with many subdomains.

  1. We need to decide how we want to provide the data. Should we maintain a table containing all entries with their classification? Additionally, we could provide a UDF to query the taxonomy of a given hostname, which can also help us maintain the logic discussed above.
  2. @rviscomi wants to integrate the classification during the crawling process. @yoohan can assist us with this integration

Many thanks to @yohhaan for his contributions throughout the entire process! He adjusted his repository and ran the scripts for the classification.

yohhaan commented 5 months ago

Hello all,

Thanks @nrllh for this summary and for the resources provided to classify these hostnames. Some ideas:

  1. It could still be valuable to release the initial classification of all hostnames by the Topics API (i.e., vanilla results even if some are highly influenced by their "hashed subdomains" or if the classification returned Unknown). Also, the version of the Topics model used (latest: chrome5) should be encoded somehow (maybe just in the name of the table?).

I would then release a second table listing the corrections you have started to manually map (eTLD+1 with corresponding topics). That way, we can then do as you describe: write a joint SQL query/UDF that returns the (corrected) classification that the HTTP archive would like to use.

  1. I suppose the process previously described in 1. would work for the almanac 2024, but moving forward if you want to classify during each crawl (like per 3.) the structure may have to be a bit different.

  2. I made public the setup and script I used for this classification: https://github.com/yohhaan/httparchive-topics-classification for future reference. @rviscomi ping me if you need help/have questions.

rviscomi commented 5 months ago

Spoke with @pmeenan about this and for now we're hesitant to add this directly into the crawl for maintainability reasons. For unblocking the Web Almanac analysis, we could either host the classifications directly in HTTP Archive's BigQuery project, or if @yohhaan is willing to host it, he'd just need to open up public access to the existing table. My preference would be to host it on HTTP Archive so that we'd absorb any storage costs / maintenance burden. WDYT?

If we do store the unprocessed classifications, I like the idea of using a persistent UDF, maybe something like httparchive.fn.CLASSIFY_HOSTNAME. We could add reusable logic there to handle the googlesyndication edge case and so on. I'd imagine that this is mostly affected by subresources rather than document URLs, so if the UDF adds too much overhead we might want to consider storing the post-processed classifications or having a second UDF for documents.

Longer term, we could re-explore classifying as part of the crawl. We should also re-explore classifying hosts using BigQuery ML, which would let us analyze older crawls.

nrllh commented 4 months ago

I processed the hosts from the last crawl as well. We now have what we all need. I've also spoken with @yohhaan, and we would like to store the data in HA's project. You can access the data via:

SELECT * FROM misc-researches.ha.ha_host_categories

The table consists of 5 columns:

Column Description
host hostname of the records
category_id taxonomy id (based on v2: taxonomy_v2)
full_category this is the full category, the taxonomies have a parent-children structure; in this column, we provide the full category
subcategory this is the subcategory of the host, representing the most specific category in the full category path
parent_category the parent of the subcategory

I also created a UDF to retrieve the category for a given host. The function has two parameters: input_host and row_results. The row_results parameter addresses our discussion about hashed subdomains. When row_results is set to false, the function checks if the host's domain is among the top 50 domains known for hashed subdomains. If so, it returns the category of the domain instead of the hostname. The function returns an array of JSON with the host, category ID, full category, subcategory, and parent category. We may also want to retrieve only full category of the hostnames. Note that we have for some hostnames multiple categories that's why we need to handle that.

CREATE OR REPLACE FUNCTION `misc-researches.ha.get_host_category`(
  input_host STRING,
  row_results BOOL
) RETURNS STRING AS (
  (
    WITH
      host_to_query AS (
        SELECT
          CASE
            WHEN NOT row_results AND net.reg_domain(input_host) IN (
               'googlesyndication.com', 'gstatic.com', 'imrworldwide.com', 'cloudfront.net', 'upravel.com', 
                'adsco.re', 'fastly-insights.com', 'cedexis-radar.net', 'beeline.ru', 'quora.com', 
                'online-metrix.net', 'ampproject.net', 'bumlam.com', 'forter.com', 'googleusercontent.com', 
                'dropboxusercontent.com', 'yahoodns.net', 'hinet.net', 'alibaba.com', 'tumblr.com', 
                'amazonaws.com', 'googleadservices.com', 'akamaihd.net', 'filesusr.com', 'dotnxdomain.net', 
                'nitrocdn.com', 'doubleclick.net', 'disqus.com', 'business.site', 'softonic.com', 'sensic.net', 
                'zendesk.com', 'stbid.ru', 'uptodown.com', 'wpengine.com', 'dca0.com', 'onef.pro', 
                'netdna-ssl.com', 'secureserver.net', 'bandcamp.com', 'parsely.com', 'editmysite.com', 
                'footprintdns.com', 'ioam.de', 'ridge1.com', 'optimole.com', 'whiteboxdigital.ru', 
                'sentry.io', 'mysimplestore.com', 'wix-code.com', 'smushcdn.com'
            ) THEN net.reg_domain(input_host)
            ELSE input_host
          END AS query_host
      ),
      -- Retrieve the categories
      categories_data AS (
        SELECT
          hostname,
          category_id,
          full_category,
          subcategory,
          parent_category
        FROM
          `misc-researches.ha.ha_host_categories`
        WHERE
          hostname = (SELECT query_host FROM host_to_query)
      )
    SELECT
      TO_JSON_STRING(ARRAY_AGG(STRUCT(
        hostname AS host,
        category_id,
        full_category,
        subcategory,
        parent_category
      )))
    FROM
      categories_data
  )
);
rviscomi commented 4 months ago

This is great, thanks!

I've cloned the data into httparchive.urls.categories:

CREATE OR REPLACE TABLE `httparchive.urls.categories` AS
SELECT * FROM `misc-researches.ha.ha_host_categories`

I've also saved the persistent UDF to httparchive.fn.GET_HOST_CATEGORIES with a couple of changes:

Here's the query:

CREATE OR REPLACE FUNCTION `httparchive.fn.GET_HOST_CATEGORIES`(
  input_host STRING,
  row_results BOOL
) RETURNS ARRAY<STRUCT<
  host STRING,
  category_id INT64,
  full_category STRING,
  subcategory STRING,
  parent_category STRING
>> AS (
  (
    WITH
      host_to_query AS (
        SELECT
          CASE
            WHEN NOT row_results AND net.reg_domain(input_host) IN (
               'googlesyndication.com', 'gstatic.com', 'imrworldwide.com', 'cloudfront.net', 'upravel.com', 
                'adsco.re', 'fastly-insights.com', 'cedexis-radar.net', 'beeline.ru', 'quora.com', 
                'online-metrix.net', 'ampproject.net', 'bumlam.com', 'forter.com', 'googleusercontent.com', 
                'dropboxusercontent.com', 'yahoodns.net', 'hinet.net', 'alibaba.com', 'tumblr.com', 
                'amazonaws.com', 'googleadservices.com', 'akamaihd.net', 'filesusr.com', 'dotnxdomain.net', 
                'nitrocdn.com', 'doubleclick.net', 'disqus.com', 'business.site', 'softonic.com', 'sensic.net', 
                'zendesk.com', 'stbid.ru', 'uptodown.com', 'wpengine.com', 'dca0.com', 'onef.pro', 
                'netdna-ssl.com', 'secureserver.net', 'bandcamp.com', 'parsely.com', 'editmysite.com', 
                'footprintdns.com', 'ioam.de', 'ridge1.com', 'optimole.com', 'whiteboxdigital.ru', 
                'sentry.io', 'mysimplestore.com', 'wix-code.com', 'smushcdn.com'
            ) THEN net.reg_domain(input_host)
            ELSE input_host
          END AS query_host
      ),
      -- Retrieve the categories
      categories_data AS (
        SELECT
          hostname,
          category_id,
          full_category,
          subcategory,
          parent_category
        FROM
          `httparchive.urls.categories`
        WHERE
          hostname = (SELECT query_host FROM host_to_query)
      )
    SELECT
      ARRAY_AGG(STRUCT(
        hostname AS host,
        category_id,
        full_category,
        subcategory,
        parent_category
      ))
    FROM
      categories_data
  )
);

Example usage:

SELECT
  *
FROM
  UNNEST(httparchive.fn.GET_HOST_CATEGORIES('npr.org', TRUE))

Results:

[{
  "host": "npr.org",
  "category_id": "23",
  "full_category": "/Arts \u0026 Entertainment/Music \u0026 Audio",
  "subcategory": "Music \u0026 Audio",
  "parent_category": "/Arts \u0026 Entertainment/"
}, {
  "host": "npr.org",
  "category_id": "243",
  "full_category": "/News",
  "subcategory": "News",
  "parent_category": "/News"
}]

@nrllh I'm not so sure about the row_results param. Is it ever useful to look up hosts like 1234.googlesyndication.com? Why not unconditionally map these hosts to their domain, something like WHEN net.reg_domain(input_host) IN ...?

nrllh commented 4 months ago

Thank you. I agree, row_results was unnecessary. We can indeed map these hosts directly to their domains.

Changes made:

  1. Added formatted_input_host to ensure the input is correctly formatted, allowing the function to handle also full URLs.
  2. Removed row_results.
CREATE OR REPLACE FUNCTION `httparchive.fn.GET_HOST_CATEGORIES`(
  input_host STRING
) RETURNS ARRAY<STRUCT<
  host STRING,
  category_id INT64,
  full_category STRING,
  subcategory STRING,
  parent_category STRING
>> AS (
  (
    WITH
      formatted_host AS (
        SELECT
          net.host(input_host) AS formatted_input_host
      ),
      host_to_query AS (
        SELECT
          CASE
            WHEN net.reg_domain((SELECT formatted_input_host FROM formatted_host)) IN (
               'googlesyndication.com', 'gstatic.com', 'imrworldwide.com', 'cloudfront.net', 'upravel.com', 
                'adsco.re', 'fastly-insights.com', 'cedexis-radar.net', 'beeline.ru', 'quora.com', 
                'online-metrix.net', 'ampproject.net', 'bumlam.com', 'forter.com', 'googleusercontent.com', 
                'dropboxusercontent.com', 'yahoodns.net', 'hinet.net', 'alibaba.com', 'tumblr.com', 
                'amazonaws.com', 'googleadservices.com', 'akamaihd.net', 'filesusr.com', 'dotnxdomain.net', 
                'nitrocdn.com', 'doubleclick.net', 'disqus.com', 'business.site', 'softonic.com', 'sensic.net', 
                'zendesk.com', 'stbid.ru', 'uptodown.com', 'wpengine.com', 'dca0.com', 'onef.pro', 
                'netdna-ssl.com', 'secureserver.net', 'bandcamp.com', 'parsely.com', 'editmysite.com', 
                'footprintdns.com', 'ioam.de', 'ridge1.com', 'optimole.com', 'whiteboxdigital.ru', 
                'sentry.io', 'mysimplestore.com', 'wix-code.com', 'smushcdn.com'
            ) THEN net.reg_domain((SELECT formatted_input_host FROM formatted_host))
            ELSE (SELECT formatted_input_host FROM formatted_host)
          END AS query_host
      ),
      -- Retrieve the categories
      categories_data AS (
        SELECT
          hostname,
          category_id,
          full_category,
          subcategory,
          parent_category
        FROM
          `httparchive.urls.categories`
        WHERE
          hostname = (SELECT query_host FROM host_to_query)
      )
    SELECT
      ARRAY_AGG(STRUCT(
        hostname AS host,
        category_id,
        full_category,
        subcategory,
        parent_category
      ))
    FROM
      categories_data
  )
);

I'll prepare a document for har.fyi so we can announce it. Please let me know once you have updated the function.

rviscomi commented 4 months ago

Updated!

SELECT
  *
FROM
  UNNEST(httparchive.fn.GET_HOST_CATEGORIES('apple.com'))
[{
  "host": "apple.com",
  "category_id": "528",
  "full_category": "/Internet \u0026 Telecom/Mobile Phones",
  "subcategory": "Mobile Phones",
  "parent_category": "/Internet \u0026 Telecom/"
}, {
  "host": "apple.com",
  "category_id": "129",
  "full_category": "/Computers \u0026 Electronics/Consumer Electronics",
  "subcategory": "Consumer Electronics",
  "parent_category": "/Computers \u0026 Electronics/"
}]