codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Newspaper with local files #753

Open Cristinutaa opened 4 years ago

Cristinutaa commented 4 years ago

I'm trying to test the following function:

def _get_main_content_from_url(url: str) -> str:

    """
    Scrape and parse textual content from web resource. This method employs Article from Newspaper3k library to download and parse html from the web resource. It uses heuristics to scrape main body of visible text.

    :param url: Uniform Resource Locator.

    :return: Scraped content of web resource.
    """
    try:

        article = Article(url)

        article.download()

        article.parse()

        return article.text.lower().replace('\n', ' ')

    except:

        print('url failed: {}'.format(url))

        return ''

For this, I have a local index.html file When passing the url = "file://path/to/html/index.html" to my functions, I get

newspaper.article.ArticleException: Articledownload()failed with No connection adapters were found for 'file://path/to/html/index.html' on URL file://path/to/html/index.html

I've read that requests only support http and https, but you are using local files in the test repository of the newspaper library. What happens?

iwpnd commented 4 years ago

If you want to parse text from a html file on your system I suggest you do:

with open('path/to/yourfile', 'r') as f:
    html = f.read()

article = Article(url='yoururl')
article.download(input_html=html)
article.parse()
print(article.text)
>> your article text

File uris as input are not (yet?) supported.

bilalghanem commented 4 years ago

If you're scarping from Web and you get the same error, check the link prefix. Does it start with http/s? www?