Open nrllh opened 5 months ago
Hello all,
Thanks @nrllh for this summary and for the resources provided to classify these hostnames. Some ideas:
Unknown
). Also, the version of the Topics model used (latest: chrome5) should be encoded somehow (maybe just in the name of the table?).I would then release a second table listing the corrections you have started to manually map (eTLD+1 with corresponding topics). That way, we can then do as you describe: write a joint SQL query/UDF that returns the (corrected) classification that the HTTP archive would like to use.
I suppose the process previously described in 1. would work for the almanac 2024, but moving forward if you want to classify during each crawl (like per 3.) the structure may have to be a bit different.
I made public the setup and script I used for this classification: https://github.com/yohhaan/httparchive-topics-classification for future reference. @rviscomi ping me if you need help/have questions.
Spoke with @pmeenan about this and for now we're hesitant to add this directly into the crawl for maintainability reasons. For unblocking the Web Almanac analysis, we could either host the classifications directly in HTTP Archive's BigQuery project, or if @yohhaan is willing to host it, he'd just need to open up public access to the existing table. My preference would be to host it on HTTP Archive so that we'd absorb any storage costs / maintenance burden. WDYT?
If we do store the unprocessed classifications, I like the idea of using a persistent UDF, maybe something like httparchive.fn.CLASSIFY_HOSTNAME
. We could add reusable logic there to handle the googlesyndication edge case and so on. I'd imagine that this is mostly affected by subresources rather than document URLs, so if the UDF adds too much overhead we might want to consider storing the post-processed classifications or having a second UDF for documents.
Longer term, we could re-explore classifying as part of the crawl. We should also re-explore classifying hosts using BigQuery ML, which would let us analyze older crawls.
I processed the hosts from the last crawl as well. We now have what we all need. I've also spoken with @yohhaan, and we would like to store the data in HA's project. You can access the data via:
SELECT * FROM
misc-researches.ha.ha_host_categories
The table consists of 5 columns:
Column | Description |
---|---|
host | hostname of the records |
category_id | taxonomy id (based on v2: taxonomy_v2) |
full_category | this is the full category, the taxonomies have a parent-children structure; in this column, we provide the full category |
subcategory | this is the subcategory of the host, representing the most specific category in the full category path |
parent_category | the parent of the subcategory |
I also created a UDF to retrieve the category for a given host. The function has two parameters: input_host and row_results. The row_results parameter addresses our discussion about hashed subdomains. When row_results is set to false, the function checks if the host's domain is among the top 50 domains known for hashed subdomains. If so, it returns the category of the domain instead of the hostname. The function returns an array of JSON with the host, category ID, full category, subcategory, and parent category. We may also want to retrieve only full category of the hostnames. Note that we have for some hostnames multiple categories that's why we need to handle that.
CREATE OR REPLACE FUNCTION `misc-researches.ha.get_host_category`(
input_host STRING,
row_results BOOL
) RETURNS STRING AS (
(
WITH
host_to_query AS (
SELECT
CASE
WHEN NOT row_results AND net.reg_domain(input_host) IN (
'googlesyndication.com', 'gstatic.com', 'imrworldwide.com', 'cloudfront.net', 'upravel.com',
'adsco.re', 'fastly-insights.com', 'cedexis-radar.net', 'beeline.ru', 'quora.com',
'online-metrix.net', 'ampproject.net', 'bumlam.com', 'forter.com', 'googleusercontent.com',
'dropboxusercontent.com', 'yahoodns.net', 'hinet.net', 'alibaba.com', 'tumblr.com',
'amazonaws.com', 'googleadservices.com', 'akamaihd.net', 'filesusr.com', 'dotnxdomain.net',
'nitrocdn.com', 'doubleclick.net', 'disqus.com', 'business.site', 'softonic.com', 'sensic.net',
'zendesk.com', 'stbid.ru', 'uptodown.com', 'wpengine.com', 'dca0.com', 'onef.pro',
'netdna-ssl.com', 'secureserver.net', 'bandcamp.com', 'parsely.com', 'editmysite.com',
'footprintdns.com', 'ioam.de', 'ridge1.com', 'optimole.com', 'whiteboxdigital.ru',
'sentry.io', 'mysimplestore.com', 'wix-code.com', 'smushcdn.com'
) THEN net.reg_domain(input_host)
ELSE input_host
END AS query_host
),
-- Retrieve the categories
categories_data AS (
SELECT
hostname,
category_id,
full_category,
subcategory,
parent_category
FROM
`misc-researches.ha.ha_host_categories`
WHERE
hostname = (SELECT query_host FROM host_to_query)
)
SELECT
TO_JSON_STRING(ARRAY_AGG(STRUCT(
hostname AS host,
category_id,
full_category,
subcategory,
parent_category
)))
FROM
categories_data
)
);
This is great, thanks!
I've cloned the data into httparchive.urls.categories
:
CREATE OR REPLACE TABLE `httparchive.urls.categories` AS
SELECT * FROM `misc-researches.ha.ha_host_categories`
I've also saved the persistent UDF to httparchive.fn.GET_HOST_CATEGORIES
with a couple of changes:
categories
table in the HA projectHere's the query:
CREATE OR REPLACE FUNCTION `httparchive.fn.GET_HOST_CATEGORIES`(
input_host STRING,
row_results BOOL
) RETURNS ARRAY<STRUCT<
host STRING,
category_id INT64,
full_category STRING,
subcategory STRING,
parent_category STRING
>> AS (
(
WITH
host_to_query AS (
SELECT
CASE
WHEN NOT row_results AND net.reg_domain(input_host) IN (
'googlesyndication.com', 'gstatic.com', 'imrworldwide.com', 'cloudfront.net', 'upravel.com',
'adsco.re', 'fastly-insights.com', 'cedexis-radar.net', 'beeline.ru', 'quora.com',
'online-metrix.net', 'ampproject.net', 'bumlam.com', 'forter.com', 'googleusercontent.com',
'dropboxusercontent.com', 'yahoodns.net', 'hinet.net', 'alibaba.com', 'tumblr.com',
'amazonaws.com', 'googleadservices.com', 'akamaihd.net', 'filesusr.com', 'dotnxdomain.net',
'nitrocdn.com', 'doubleclick.net', 'disqus.com', 'business.site', 'softonic.com', 'sensic.net',
'zendesk.com', 'stbid.ru', 'uptodown.com', 'wpengine.com', 'dca0.com', 'onef.pro',
'netdna-ssl.com', 'secureserver.net', 'bandcamp.com', 'parsely.com', 'editmysite.com',
'footprintdns.com', 'ioam.de', 'ridge1.com', 'optimole.com', 'whiteboxdigital.ru',
'sentry.io', 'mysimplestore.com', 'wix-code.com', 'smushcdn.com'
) THEN net.reg_domain(input_host)
ELSE input_host
END AS query_host
),
-- Retrieve the categories
categories_data AS (
SELECT
hostname,
category_id,
full_category,
subcategory,
parent_category
FROM
`httparchive.urls.categories`
WHERE
hostname = (SELECT query_host FROM host_to_query)
)
SELECT
ARRAY_AGG(STRUCT(
hostname AS host,
category_id,
full_category,
subcategory,
parent_category
))
FROM
categories_data
)
);
Example usage:
SELECT
*
FROM
UNNEST(httparchive.fn.GET_HOST_CATEGORIES('npr.org', TRUE))
Results:
[{
"host": "npr.org",
"category_id": "23",
"full_category": "/Arts \u0026 Entertainment/Music \u0026 Audio",
"subcategory": "Music \u0026 Audio",
"parent_category": "/Arts \u0026 Entertainment/"
}, {
"host": "npr.org",
"category_id": "243",
"full_category": "/News",
"subcategory": "News",
"parent_category": "/News"
}]
@nrllh I'm not so sure about the row_results
param. Is it ever useful to look up hosts like 1234.googlesyndication.com
? Why not unconditionally map these hosts to their domain, something like WHEN net.reg_domain(input_host) IN ...
?
Thank you.
I agree, row_results
was unnecessary. We can indeed map these hosts directly to their domains.
Changes made:
formatted_input_host
to ensure the input is correctly formatted, allowing the function to handle also full URLs.row_results
.CREATE OR REPLACE FUNCTION `httparchive.fn.GET_HOST_CATEGORIES`(
input_host STRING
) RETURNS ARRAY<STRUCT<
host STRING,
category_id INT64,
full_category STRING,
subcategory STRING,
parent_category STRING
>> AS (
(
WITH
formatted_host AS (
SELECT
net.host(input_host) AS formatted_input_host
),
host_to_query AS (
SELECT
CASE
WHEN net.reg_domain((SELECT formatted_input_host FROM formatted_host)) IN (
'googlesyndication.com', 'gstatic.com', 'imrworldwide.com', 'cloudfront.net', 'upravel.com',
'adsco.re', 'fastly-insights.com', 'cedexis-radar.net', 'beeline.ru', 'quora.com',
'online-metrix.net', 'ampproject.net', 'bumlam.com', 'forter.com', 'googleusercontent.com',
'dropboxusercontent.com', 'yahoodns.net', 'hinet.net', 'alibaba.com', 'tumblr.com',
'amazonaws.com', 'googleadservices.com', 'akamaihd.net', 'filesusr.com', 'dotnxdomain.net',
'nitrocdn.com', 'doubleclick.net', 'disqus.com', 'business.site', 'softonic.com', 'sensic.net',
'zendesk.com', 'stbid.ru', 'uptodown.com', 'wpengine.com', 'dca0.com', 'onef.pro',
'netdna-ssl.com', 'secureserver.net', 'bandcamp.com', 'parsely.com', 'editmysite.com',
'footprintdns.com', 'ioam.de', 'ridge1.com', 'optimole.com', 'whiteboxdigital.ru',
'sentry.io', 'mysimplestore.com', 'wix-code.com', 'smushcdn.com'
) THEN net.reg_domain((SELECT formatted_input_host FROM formatted_host))
ELSE (SELECT formatted_input_host FROM formatted_host)
END AS query_host
),
-- Retrieve the categories
categories_data AS (
SELECT
hostname,
category_id,
full_category,
subcategory,
parent_category
FROM
`httparchive.urls.categories`
WHERE
hostname = (SELECT query_host FROM host_to_query)
)
SELECT
ARRAY_AGG(STRUCT(
hostname AS host,
category_id,
full_category,
subcategory,
parent_category
))
FROM
categories_data
)
);
I'll prepare a document for har.fyi so we can announce it. Please let me know once you have updated the function.
Updated!
SELECT
*
FROM
UNNEST(httparchive.fn.GET_HOST_CATEGORIES('apple.com'))
[{
"host": "apple.com",
"category_id": "528",
"full_category": "/Internet \u0026 Telecom/Mobile Phones",
"subcategory": "Mobile Phones",
"parent_category": "/Internet \u0026 Telecom/"
}, {
"host": "apple.com",
"category_id": "129",
"full_category": "/Computers \u0026 Electronics/Consumer Electronics",
"subcategory": "Consumer Electronics",
"parent_category": "/Computers \u0026 Electronics/"
}]
Hi all,
as previously announced in Slack, we wanted to classify the URLs, and we hope to have this solved soon. We classified over 110M different hostnames. In this issue, I want to give you an overview of the method we applied, and we have some discussion points to address.
How we classified and the results
We began by extracting distinct hostnames from all requests and extended the data with the domain name (eTLD+1) of these hostnames. This gave us a total of 110M distinct hostnames to classify. Using @yohhaan’s repository, which leverages the Google Topics API, we managed to classify 89% of the hostnames. 11% could not be classified due to the Topics API excluding adult sites (e.g., gambling). The dataset is currently hosted in my private project and can be queried:
Points to discuss
The model's classification for these hostnames is very inconsistent, likely due to the hash values. Therefore, we should assign the classification of
googlesyndication.com
to all its hostnames. This issue also exists for other popular providers. I have created a sheet and manually assigned which popular sites with hashed subdomains should use the classification of the domain instead of the subdomain: Classification of sites with many subdomains.Many thanks to @yohhaan for his contributions throughout the entire process! He adjusted his repository and ran the scripts for the classification.