HTTPArchive / legacy.httparchive.org

<<THIS REPOSITORY IS DEPRECATED>> The HTTP Archive provides information about website performance such as # of HTTP requests, use of gzip, and amount of JavaScript. This information is recorded over time revealing trends in how the Internet is performing. Built using Open Source software, the code and data are available to everyone allowing researchers large and small to work from a common base.
https://legacy.httparchive.org
Other
328 stars 84 forks source link

Structured Data 2021 #218

Closed GregBrimble closed 3 years ago

GregBrimble commented 3 years ago

https://github.com/HTTPArchive/almanac.httparchive.org/issues/2174 https://docs.google.com/document/d/19KDSv4olAXUHUV6hq4X4Cb-lNziqvVesgXXxVktrw4c/edit#

GregBrimble commented 3 years ago

The only non-JSON-LD type I haven't enhanced in this most recent commit is Dublin Core. Dublin Core can be represented in a number of different ways, so again, I think if we tried here, we'd likely fail in collecting some implementations. Perhaps more can be done later with the raw HTML.

jonoalderson commented 3 years ago

The only non-JSON-LD type I haven't enhanced in this most recent commit is Dublin Core. Dublin Core can be represented in a number of different ways, so again, I think if we tried here, we'd likely fail in collecting some implementations. Perhaps more can be done later with the raw HTML.

I'd be happy limiting this to extracting any <meta tag with a name property beginning with DC (case-insensitive). I think that give us the majority of 'normal' usage.

GregBrimble commented 3 years ago

Cool. As long as you say that disclaimer in your writing, that's good with me :) I'll add that in now.

GregBrimble commented 3 years ago

The full meta/link tags were saved before we have more specific checks later. Twitter, Facebook and OpenGraph should all be covered now. Due to its expressiveness, it's just Dublin Core that might be missing some of the capturing that we've implemented.

If it's a storage space concern, I think we could nix it (@jono-alderson has already said they're fine with just capturing tags which begin with DC). But if we can keep it, we might find other Dublin Core tags, which we'd otherwise lose.

jonoalderson commented 3 years ago

Yeah, let's simplify and accept that we might miss some stuff. First year, let's keep it simple.

GregBrimble commented 3 years ago

Consider it gone!