Closed yxzhu16 closed 3 years ago
Thanks for the issue @yxzhu16.
I looked at the source of the page, and here's the HTML where the missing link comes from:
<a href="/web/20200903005938/https://www.bbc.com/pidgin/world-54001522" class="Link-sc-1dvfmi3-5 StyledLink-sc-16i2p1z-2 fdDiSd">Five Tyler Perry movies wey make serious money for di Hollywood newest billionaire</a>
Is it possible our ExtractLinks
use of jsoup
isn't picking out those re-written links because they're non-traditional?
Hi,
I've recently come across the same issue and I think it's because the link references a relative instead of an absolute URL.
In the AUT Scala code, ExtractLinks
can have 3 parameters:
* @param src the src link
* @param html the content from which links are to be extracted
* @param base an optional base URI
The base URI is required to resolve relative URLs using link.attr("abs:href")
. So I think you have to specify a base URI to be able to extract all links.
At the moment, however, the Python UDF extract_links
only expects 2 parameters, if I understand the code correctly. It may be necessary to adapt the Python UDF to include the base
parameter.
Fantastic stuff, @yxzhu16 – thanks so much for the pull request (and for the info on this too @schmika – much appreciated).
Describe the bug When extracting links from https://web.archive.org/web/20200903005938/https://www.bbc.com/pidgin which is from Wayback machine, only several links are showing up and most useful links are missing.
To Reproduce Steps to reproduce the behavior (e.g.):
Expected behavior Around 100 hyperlinks should show up, and should at least include https://www.bbc.com/pidgin/world-54001522.
Screenshots
Environment information