HTTPArchive / tech-report-apis

APIs for the HTTP Archive Technology Report
Apache License 2.0
2 stars 0 forks source link

Category API #6

Open rviscomi opened 11 months ago

rviscomi commented 11 months ago

For feature parity in v1 we'll also need an API to list all of the technologies for each category.

You can see how it works in the existing dashboard:

image

Enter a category name

image

The Technology dropdown updates to display only the technologies of the filtered category

The shape of the API should be an object where the keys are category names and the values are arrays of technologies sorted by popularity:

{
  "Most popular category by total number of origins": [
    "Most popular technology in the category",
    "Second most popular technology",
    "..."
  ],
  "Second most popular category": [
    "..."
  ]
}

Here's an example query to extract the categories:

WITH categories AS (
  SELECT
    category,
    COUNT(DISTINCT root_page) AS origins
  FROM
    `httparchive.all.pages`,
    UNNEST(technologies) AS t,
    UNNEST(t.categories) AS category
  WHERE
    date = '2023-08-01' AND
    client = 'mobile'
  GROUP BY
    category
),

technologies AS (
  SELECT
    category,
    technology,
    COUNT(DISTINCT root_page) AS origins
  FROM
    `httparchive.all.pages`,
    UNNEST(technologies) AS t,
    UNNEST(t.categories) AS category
  WHERE
    date = '2023-08-01' AND
    client = 'mobile'
  GROUP BY
    category,
    technology
)

SELECT
  category,
  categories.origins,
  ARRAY_AGG(technology ORDER BY technologies.origins DESC) AS technologies
FROM
  categories
JOIN
  technologies
USING
  (category)
GROUP BY
  category,
  categories.origins
ORDER BY
  categories.origins DESC

I've formatted the output and saved the results to a static file: https://github.com/HTTPArchive/tech-report-apis/blob/main/static/categories.json

Also available via the CDN: https://cdn.httparchive.org/reports/cwvtech/categories.json

cc @sarahfossheim

maceto commented 10 months ago

@rviscomi, should we have any mandatory param for this endpoint?

rviscomi commented 10 months ago

I'd say only the category name should be a required parameter, but I'll defer to @sarahfossheim if it'd be useful to have any special behavior when it's omitted. For example, maybe it could list only the category names.

sarahfossheim commented 10 months ago

We do need to get the list of category names as well (for the category filter dropdown), so that'd be useful yes

maceto commented 10 months ago

Example of how to consume this endpoint

One category or Multiple categories

curl --request GET \
  --url 'https://dev-gw-2vzgiib6.ue.gateway.dev/v1/categories?category=["Blogs"]'
curl --request GET \
  --url 'https://dev-gw-2vzgiib6.ue.gateway.dev/v1/categories?category=["Blogs","Domain parking"]'

or for only category names

 curl --request GET \
  --url 'https://dev-gw-2vzgiib6.ue.gateway.dev/v1/categories?onlyname=true'

@rviscomi @sarahfossheim let me know if this is helpful in this way.

rviscomi commented 10 months ago

Per our chat, change to (here and other APIs):

https://dev-gw-2vzgiib6.ue.gateway.dev/v1/categories?category=Blogs,Domain%20parking

On the frontend we'll need to URL-encode each input param

maceto commented 10 months ago

@rviscomi @sarahfossheim all the changes discussed are already deployed.

New URL https://dev-gw-2vzgiib6.uk.gateway.dev/v1/categories

Documentation: https://github.com/HTTPArchive/tech-report-apis#get-categories

maceto commented 7 months ago

Hi @rviscomi

why does the query for categories contain WHERE ... client = 'mobile' ? are there no categories for desktop ?

rviscomi commented 7 months ago

Every technology category that exists on desktop pages almost certainly exists on mobile, so this was a small query optimization to avoid processing half the dataset.