archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Extract hyperlinks from wayback machine #501

Closed yxzhu16 closed 3 years ago

yxzhu16 commented 3 years ago

Describe the bug When extracting links from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin which is from Wayback machine, only several links are showing up and most useful links are missing.

To Reproduce Steps to reproduce the behavior (e.g.):

  1. Load a WARC crawled from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin
  2. Extract links
  3. Not all of the hyperlinks are showing up

Expected behavior Around 100 hyperlinks should show up, and should at least include https://www.bbc.com/pidgin/world-54001522.

Screenshots

Screen Shot 2020-10-05 at 12 54 56 PM

Environment information

ianmilligan1 commented 3 years ago

Thanks for the issue @yxzhu16.

I looked at the source of the page, and here's the HTML where the missing link comes from:

<a href="/web/20200903005938/https://www.bbc.com/pidgin/world-54001522" class="Link-sc-1dvfmi3-5 StyledLink-sc-16i2p1z-2 fdDiSd">Five Tyler Perry movies wey make serious money for di Hollywood newest billionaire</a>

Is it possible our ExtractLinks use of jsoup isn't picking out those re-written links because they're non-traditional?

schmika commented 3 years ago

Hi, I've recently come across the same issue and I think it's because the link references a relative instead of an absolute URL. In the AUT Scala code, ExtractLinks can have 3 parameters:

* @param src the src link
* @param html the content from which links are to be extracted
* @param base an optional base URI

The base URI is required to resolve relative URLs using link.attr("abs:href"). So I think you have to specify a base URI to be able to extract all links. At the moment, however, the Python UDF extract_links only expects 2 parameters, if I understand the code correctly. It may be necessary to adapt the Python UDF to include the base parameter.

ianmilligan1 commented 3 years ago

Fantastic stuff, @yxzhu16 – thanks so much for the pull request (and for the info on this too @schmika – much appreciated).