codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.19k stars 2.12k forks source link

not getting all article links after refreshing again the code? #943

Closed Aliktk closed 2 years ago

Aliktk commented 2 years ago

hello everyone I am scrapping a website to get the latest tab news. I followed the code and at first it worked perfectly but after code cleaning and running again it returned nothing at all? Here is the code:

Initial Tab Link Scraping and then pass the link to build newspaper object for getting the articles links:

# function to extract HTML document from given URL
def getHTMLdocument(url):
    # request for HTML document of given url
    response = requests.get(url)
    response = response.text  
    # response will be provided in JSON format
    return response

# soup
# create soap object
def b_soup(html_document):
    soup = BeautifulSoup(html_document, 'html.parser')
    # soup.prettify()
    return soup

def scrap_links(link):
    urls = []   
    # extract all links from the url
    news_paper = newspaper.build(link)
    for article in news_paper.articles:
        urls.append(article.url)
    return URLs

url_to_scrape = 'https://www.thenews.com.pk/'
# create document
html_document = getHTMLdocument(url_to_scrape)
soup = b_soup(html_document)
# find all the required tags links with "href" 
# attribute starting with "https://"
links = []
latest = soup.find_all('a',attrs={'href': re.compile("latest-stories")})[0]
links.append(world.get('href'))
latest_news = links[0]
# Scrap the articles links using newsarticles
latest_urls = scrap_links(latest_news)

It returns me:

https://www.thenews.com.pk/latest-stories
[]

Is there any limit that you can't refresh the links for scraping or something else wrong? Thank you

johnbumgarner commented 2 years ago

The function _scraplinks() is incorrect. Build isn't used that way. Reference my overview document on how to use build.

Aliktk commented 2 years ago

Thank you @johnbumgarner for your reply solved as stated:

I Used below chunk of code and it return the links of all articles.

def scrap_links(link):
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    article_urls = set()
    marketwatch = newspaper.build(
        link, config=config, memoize_articles=False, language='en')
    for sub_article in marketwatch.articles:
        article = Article(sub_article.url, config=config,
                          memoize_articles=False, language='en')
        article.download()
        article.parse()
        if article.url not in article_urls:
            article_urls.add(article.url)
    return article_urls
johnbumgarner commented 2 years ago

You're welcome @Aliktk. Happy coding.