commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Complete HTML link extraction to cover all element attributes of type URI #9

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

The HTML specs provide list of attributes including the required type. All attributes of type URI should be covered by the ExtractingParseObserver when links are extracted and added as "Links" to the WAT file. See

Several attributes are missing, e.g., "cite" for <q> and <blockquote>, or embedded elements introduced with HTML5 (<video>, <audio>).

This issue precedes #7 and #8, and should include a unit test.