Open jwarzech opened 9 years ago
I've also seen this bug, it's pretty annoying and a PR would be much appreciated!
Just hit this myself. I would have to dig in to see if I could even figure out how to do a PR.
But basically I'd think you would want to:
I noticed that this also affects relative URLs.
This is how I did it with links and images
def make_links_absoulte(article):
if len(article.article_html) == 0:
return ''
top_node = article.top_node_article_html
urlformat = urlparse(article.url)
base = article.doc.base
url_without_path = urlformat.scheme + "://" + urlformat.netloc
if base:
if urlparse(base).netloc:
output_base = base
else:
output_base = url_without_path + base
else:
output_base = url_without_path + urlformat.path
top_node.make_links_absolute(output_base)
return tostring(top_node, encoding='unicode')
When extracting an article's html I am getting a lot of broken image links if the source article uses relative image paths (common on a lot of blogs).
Is there a way we could get relative paths to get the root domain prepended to them during the extraction?
If I get a chance I'll try to dive in and do a Pull-Request for this functionality but If someone has already done this or able to get to it quicker I'm sure everyone would find this very useful!
Thanks!