codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Broken image paths #140

Open jwarzech opened 9 years ago

jwarzech commented 9 years ago

When extracting an article's html I am getting a lot of broken image links if the source article uses relative image paths (common on a lot of blogs).

Is there a way we could get relative paths to get the root domain prepended to them during the extraction?

If I get a chance I'll try to dive in and do a Pull-Request for this functionality but If someone has already done this or able to get to it quicker I'm sure everyone would find this very useful!

Thanks!

codelucas commented 8 years ago

I've also seen this bug, it's pretty annoying and a PR would be much appreciated!

thoraxe commented 6 years ago

Just hit this myself. I would have to dig in to see if I could even figure out how to do a PR.

But basically I'd think you would want to:

thoraxe commented 6 years ago

I noticed that this also affects relative URLs.

mercuree commented 6 years ago

This is how I did it with links and images

def make_links_absoulte(article):

    if len(article.article_html) == 0:
        return ''
    top_node = article.top_node_article_html
    urlformat = urlparse(article.url)
    base = article.doc.base
    url_without_path = urlformat.scheme + "://" + urlformat.netloc
    if base:
        if urlparse(base).netloc:
            output_base = base
        else:
            output_base = url_without_path + base
    else:
        output_base = url_without_path + urlformat.path

    top_node.make_links_absolute(output_base)
    return tostring(top_node, encoding='unicode')