Broken image paths - Githubissues

codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

https://goo.gl/VX41yK

MIT License

14.06k stars 2.11k forks source link

Broken image paths #140

Open jwarzech opened 9 years ago

jwarzech commented 9 years ago

When extracting an article's html I am getting a lot of broken image links if the source article uses relative image paths (common on a lot of blogs).

Is there a way we could get relative paths to get the root domain prepended to them during the extraction?

If I get a chance I'll try to dive in and do a Pull-Request for this functionality but If someone has already done this or able to get to it quicker I'm sure everyone would find this very useful!

Thanks!

codelucas commented 8 years ago

I've also seen this bug, it's pretty annoying and a PR would be much appreciated!

thoraxe commented 6 years ago

Just hit this myself. I would have to dig in to see if I could even figure out how to do a PR.

But basically I'd think you would want to:

detect that the image path is relative
if relative, deconstruct the article's URL originally supplied to get the FQDN
construct the full path by adding the FQDN and the relative path

thoraxe commented 6 years ago

I noticed that this also affects relative URLs.

mercuree commented 6 years ago

This is how I did it with links and images

def make_links_absoulte(article):

    if len(article.article_html) == 0:
        return ''
    top_node = article.top_node_article_html
    urlformat = urlparse(article.url)
    base = article.doc.base
    url_without_path = urlformat.scheme + "://" + urlformat.netloc
    if base:
        if urlparse(base).netloc:
            output_base = base
        else:
            output_base = url_without_path + base
    else:
        output_base = url_without_path + urlformat.path

    top_node.make_links_absolute(output_base)
    return tostring(top_node, encoding='unicode')