Closed yxzhu16 closed 3 years ago
Merging #504 into main will decrease coverage by
0.02%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #504 +/- ##
============================================
- Coverage 88.85% 88.83% -0.03%
Complexity 57 57
============================================
Files 43 43
Lines 1014 1012 -2
Branches 86 85 -1
============================================
- Hits 901 899 -2
Misses 74 74
Partials 39 39
GitHub issue(s): #501
What does this Pull Request do?
src
instead ofbase
when extracting links, and deletedbase
parameter The issue occurred because relative links cannot be extracted bylink.attr("abs:href")
when baseUri is not set. As I look through the code, parambase
is never provided anywhere whenExtractLinks
is called, so default value "" is always used, and baseUri is never set. However, baseUri is required to be able to extract relative links.Instead of adapting the Python UDF to include the base parameter, I think it makes sense to set the baseUri to be
src
. Similar as https://github.com/archivesunleashed/aut/blob/00e816629a86d767ff9c11324963d2c8368b0a35/src/main/scala/io/archivesunleashed/matchbox/ExtractImageLinks.scala#L42I tested on Wayback Machine bbc pidgin pages and it works.
java.lang.StackOverflowError
because of the large number of links.How should this be tested?