Extract hyperlinks from wayback machine

archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

https://aut.docs.archivesunleashed.org/

Apache License 2.0

137 stars 33 forks source link

Extract hyperlinks from wayback machine #501

Closed yxzhu16 closed 3 years ago

yxzhu16 commented 3 years ago

Describe the bug When extracting links from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin which is from Wayback machine, only several links are showing up and most useful links are missing.

To Reproduce Steps to reproduce the behavior (e.g.):

Load a WARC crawled from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin
Extract links
Not all of the hyperlinks are showing up

Expected behavior Around 100 hyperlinks should show up, and should at least include https://www.bbc.com/pidgin/world-54001522.

Screenshots

Environment information

AUT version: 0.80.1-SNAPSHOT
OS: MacOS 10.15.6
Java version: Java 11
Apache Spark version: 3.0.1
Apache Spark w/aut: --jars
Apache Spark command used to run AUT: run with jupyter notebook

ianmilligan1 commented 3 years ago

Thanks for the issue @yxzhu16.

I looked at the source of the page, and here's the HTML where the missing link comes from:

<a href="/web/20200903005938/https://www.bbc.com/pidgin/world-54001522" class="Link-sc-1dvfmi3-5 StyledLink-sc-16i2p1z-2 fdDiSd">Five Tyler Perry movies wey make serious money for di Hollywood newest billionaire</a>

Is it possible our ExtractLinks use of jsoup isn't picking out those re-written links because they're non-traditional?

schmika commented 3 years ago

Hi, I've recently come across the same issue and I think it's because the link references a relative instead of an absolute URL. In the AUT Scala code, ExtractLinks can have 3 parameters:

* @param src the src link
* @param html the content from which links are to be extracted
* @param base an optional base URI

The base URI is required to resolve relative URLs using link.attr("abs:href"). So I think you have to specify a base URI to be able to extract all links. At the moment, however, the Python UDF extract_links only expects 2 parameters, if I understand the code correctly. It may be necessary to adapt the Python UDF to include the base parameter.

ianmilligan1 commented 3 years ago

Fantastic stuff, @yxzhu16 – thanks so much for the pull request (and for the info on this too @schmika – much appreciated).