HTTPArchive / tech-report-apis

APIs for the HTTP Archive Technology Report
Apache License 2.0
2 stars 0 forks source link

Technologies report #2

Open maceto opened 11 months ago

maceto commented 11 months ago

Could you describe the origin/source of this data?

sarahfossheim commented 11 months ago

I'm not sure if all of that information is available already, or where it lives in that case. But the technology name and category names are available and in use already currently, eg.: https://cdn.httparchive.org/reports/cwvtech/ALL/ALL/jQuery.json

{
    "app": "jQuery",
    "category": "JavaScript libraries, Miscellaneous, Static site generator"
}

I don't think the description exists somewhere yet, but if the first aim is feature parity, then the main thing we need is name + categories right now.

The similar technologies can probably be based on the category names, if there's no data on it yet?

rviscomi commented 10 months ago
SELECT
  client,
  app AS technology,
  # TODO
  NULL AS description,
  # CSV format
  category,
  # TODO: other technologies within category?
  NULL AS similar_technologies,
  origins
FROM
  `httparchive.core_web_vitals.technologies`
WHERE
  date = '2023-07-01' AND
  geo = 'ALL' AND
  rank = 'ALL'
ORDER BY
  origins DESC
rviscomi commented 10 months ago

@sarahfossheim how should we source the similar_technologies field, something like "top 3 technologies within same category"?

Also note that the description field isn't set in BigQuery so we'll leave it null for now.

maceto commented 10 months ago

@rviscomi, should we have any mandatory param for this endpoint?

rviscomi commented 10 months ago

I think just technology

cc @sarahfossheim

sarahfossheim commented 10 months ago

I think for the first version something like you said can make sense: technologies with at least one category in common, sorted by amount of origins, and then pick the top 3 (or maybe top 5?).

Or maybe an alternative could be:

Then technologies that have many categories in common will come up, even if they're a new or niche technology with not many origins. Which I think makes more sense when it comes to pinning down similar technologies.

If any data gets returned along with the technology names (eg. amount of origins), then we also need to pass in the rank and geo, so that the data of the similar technologies is filtered by the same criteria as the data of the current technology.

maceto commented 10 months ago

Example of how to consume this endpoint

  curl --request GET \
  --url 'https://dev-gw-2vzgiib6.ue.gateway.dev/v1/technologies?category=["Blogs", "CMS", "Ecommerce"]&technology=["WordPress", "Chameleon system"]'
maceto commented 10 months ago

@rviscomi @sarahfossheim, all the changes discussed are already deployed.

New URL https://dev-gw-2vzgiib6.uk.gateway.dev/v1/technologies

Documentation: https://github.com/HTTPArchive/tech-report-apis#get-technologies

rviscomi commented 9 months ago

Updated query to pull in the descriptions:

SELECT
  client,
  app AS technology,
  description,
  # CSV format
  category,
  # TODO: other technologies within category?
  NULL AS similar_technologies,
  origins
FROM
  `httparchive.core_web_vitals.technologies`
JOIN
  `httparchive.core_web_vitals.technology_descriptions`
ON
  app = technology
WHERE
  date = '2023-07-01' AND
  geo = 'ALL' AND
  rank = 'ALL'
ORDER BY
  origins DESC
image
maceto commented 7 months ago

Hi @rviscomi,

why is there a static date in the WHERE clause of 2023-07-01 for technologies and 2023-08-01 for categories? I think we said this should be the latest month instead?

rviscomi commented 7 months ago

Yeah it should probably track the latest month.

Is httparchive.core_web_vitals.technology_descriptions manually or auto generated? If manual, we wouldn't pick up the descriptions for any new technologies, right?