commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WAT extractor: add attributes of the <html> element as metadata #35

Open sebastian-nagel opened 1 month ago

sebastian-nagel commented 1 month ago

The element allows for a couple of attributes (global attributes) which might qualify to be added as metadata to the WAT JSON data.

Specifically, it's about the "lang" attribute, here a few examples:

<html lang="es-MX">
<html lang="zh-CN" xmlns="http://www.w3.org/1999/xhtml">
<html dir="ltr" lang="cs-cz">
<html xml:lang="en-gb" lang="en-gb" >
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="es-MX">
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr" style="overflow-x: hidden !important;">