commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

data-href not captured in WAT 'Links' metadata #7

Closed e271828- closed 7 years ago

e271828- commented 7 years ago

As per conversation with Sebastian:

Currently, a major class of link is not captured in metadata. Example:

div class="fb-video" data-href="https://www.facebook.com/facebook/videos/10153231379946729/" data-width="500" data-show-text="false"

This format is frequently used for e.g. embedded video.

sebastian-nagel commented 7 years ago

The fb-video links are part of Facebook's social plugins. A solution should include the other Facebook plugins, and be extensible to social links pointing to other networks (Google+, Twitter, etc.)

e271828- commented 7 years ago

data-href is also widely used beyond those social platforms. For example, some CMS systems and jQuery plugins also generate JS-driven links of the format div data-href="link" for navbars etc. I'd thus expect an increase in crawl depth, but this is easy enough to test.

e271828- commented 7 years ago

Various examples attached.

data-href.examples.html.zip

sebastian-nagel commented 7 years ago

Thanks, @e271828-! The data-href links should be included already in the WAT files of the February crawl.