commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Add attribute `property` of HTML meta elements #3

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 8 years ago

(reported by Christian Lund on Common Crawl Google group)

For HTML elements only the attributes name, rel, content and http-equiv are extracted. The attribute property is missing which leads to unpaired, value-only items in the WAT file

"HTML-Metadata":{"Head":{"Metas":[...,{"content":"website"},...]}}

e.g, for open graph properties

<meta property="og:type" content="website"/>

property is an RDFa attribute and is not part of the HTML standard. However, it's widely used. The WAT specification describes the data contained in HTML-Metadata as "attributes and values of HTML head elements: title, base, style, link, meta and script". There is no explicit restriction to attributes covered by one of the HTML standards.