LuChang-CS / news-crawler

A news crawler for BBC News, Reuters and New York Times.
108 stars 40 forks source link

How to crawl news by keywords? #17

Closed awxiaoxian2020 closed 2 years ago

awxiaoxian2020 commented 2 years ago

I want to crawl news including the keywords.

LuChang-CS commented 2 years ago

Could you please provide an example of keywords you want to crawl (news link, keywords position, or meta tag)?

awxiaoxian2020 commented 2 years ago

Could you please provide an example of keywords you want to crawl (news link, keywords position, or meta tag)?

in the article. For example, I want some news about "enterprise". So I think the way by keywords is a good idea.

LuChang-CS commented 2 years ago

There are several ways to find keywords in news articles:

  1. If news providers specify keywords in their html by meta tags, you can directly parse them with BeautifulSoup. For example, NYT has a meta tag <meta data-rh="true" name="news_keywords" content="Chernobyl,Nuclear energy,Ukraine,Belarus,Russia,Military,War;Armed Conflicts"/> in an article: https://www.nytimes.com/2022/01/22/world/europe/chernobyl-ukraine-invasion-russia.html. But this depends on news providers. BBC does not have this. So if you want to parse keywords from NYT, I can update the code and add it.
  2. If you want to automatically extract keywords from text, you may want to use algorithms like TF-IDF. Please also refer to this blog: https://medium.com/mlearning-ai/10-popular-keyword-extraction-algorithms-in-natural-language-processing-8975ada5750c. In this case, I cannot do more to help you.

In either case, it is possible that the extracted keywords in an article are different from what you expect. So please think twice before making decisions.

awxiaoxian2020 commented 2 years ago

There are several ways to find keywords in news articles:

  1. If news providers specify keywords in their html by meta tags, you can directly parse them with BeautifulSoup. For example, NYT has a meta tag <meta data-rh="true" name="news_keywords" content="Chernobyl,Nuclear energy,Ukraine,Belarus,Russia,Military,War;Armed Conflicts"/> in an article: https://www.nytimes.com/2022/01/22/world/europe/chernobyl-ukraine-invasion-russia.html. But this depends on news providers. BBC does not have this. So if you want to parse keywords from NYT, I can update the code and add it.
  2. If you want to automatically extract keywords from text, you may want to use algorithms like TF-IDF. Please also refer to this blog: https://medium.com/mlearning-ai/10-popular-keyword-extraction-algorithms-in-natural-language-processing-8975ada5750c. In this case, I cannot do more to help you.

In either case, it is possible that the extracted keywords in an article are different from what you expect. So please think twice before making decisions.

Thank you very much indeed. I means that I want to crawl articles through keywords. I want this crawler to have a filter function that can filter out articles without keywords instead of giving me all articles.

In fact, I need some corpus of news reports with keywords for sentiment analysis and the like.

Anyway, thanks for giving me so many solutions.

LuChang-CS commented 2 years ago

I understand your need of filtering. In fact, I don't think it is a good way that letting the crawler do the filtering. Because with or without keywords filters, the crawler needs to fetch all articles. I recommend do filtering offline so that each module can focus on it's own functions, i.e. decoupling.

awxiaoxian2020 commented 2 years ago

I understand your need of filtering. In fact, I don't think it is a good way that letting the crawler do the filtering. Because with or without keywords filters, the crawler needs to fetch all articles. I recommend do filtering offline so that each module can focus on it's own functions, i.e. decoupling.

Your advice is very useful for me as a beginner. Crawl all the articles first, then filter, the results may be more accurate. "decoupling" is magic. Thank you very much indeed. Best wishes!