archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Fix relative links extraction #504

Closed yxzhu16 closed 3 years ago

yxzhu16 commented 3 years ago

GitHub issue(s): #501

What does this Pull Request do?

Instead of adapting the Python UDF to include the base parameter, I think it makes sense to set the baseUri to be src. Similar as https://github.com/archivesunleashed/aut/blob/00e816629a86d767ff9c11324963d2c8368b0a35/src/main/scala/io/archivesunleashed/matchbox/ExtractImageLinks.scala#L42

I tested on Wayback Machine bbc pidgin pages and it works.

How should this be tested?

  1. Load the example warc and extract links
  2. Filter the src domain to be "deadlists.com"
  3. Lines highlighted should show up where dest are relative links in the source html Screen Shot 2020-10-08 at 4 10 39 PM
codecov[bot] commented 3 years ago

Codecov Report

Merging #504 into main will decrease coverage by 0.02%. The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main     #504      +/-   ##
============================================
- Coverage     88.85%   88.83%   -0.03%     
  Complexity       57       57              
============================================
  Files            43       43              
  Lines          1014     1012       -2     
  Branches         86       85       -1     
============================================
- Hits            901      899       -2     
  Misses           74       74              
  Partials         39       39