使用BeautifulSoup库爬取新闻标题

开发环境

macOS
Python3.5

需求：使用 BeautifulSoup 库获取新闻标题。

urllib — URL 处理模块 https://docs.python.org/zh-cn/3.6/library/urllib.html 实例： html = urlopen("https://readhub.cn/topics")
HTML parser —— Python 自带的 HTML 解析库 https://docs.python.org/zh-cn/3/library/html.parser.html 常用的 HTML 解析库还有： lxml
BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
- BeautifulSoup 对象： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id11 实例： soup = BeautifulSoup('<h1 class="main-title">一线城市二手房价格失守</h1>')
- BeautifulSoup的 find_all() 方法： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find-all 实例： title = bsObj.find_all("h1", {"class": "main-title"})

#!/usr/bin/python
# -*- coding: UTF-8 -*-

"""
使用BeautifulSoup库获取新闻标题

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库
"""

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

# html = urlopen("https://news.sina.com.cn/c/2019-01-26/doc-ihqfskcp0504445.shtml")
# bsObj = BeautifulSoup(html)
# print(bsObj.h1)

# 捕捉异常,增强代码健壮性。
# 提高代码复用程度，编写通用函数。
def getTitle(url):
    try:
        html = urlopen(url)
    # HTTP 错误：404 Page Not Found、500 Internal Server Error”等。
    # 所有类似情形，urlopen 函数都会抛出“HTTPError”异常。
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        # title = bsObj.h1
        title = bsObj.find_all("h1", {"class": "main-title"})
    # 如果你想要调用的标签不存在，BeautifulSoup 就会返回 None 对象。
    # 不过，如果再调用这个 None 对象下面的子标签，就会发生 AttributeError 错误。
    except AttributeError as e:
        return None
    return title

title = getTitle("https://news.sina.com.cn/c/2019-01-26/doc-ihqfskcp0504445.shtml")

if title == None:
    print("找不到文章")
else:
    print(title)

注意，以上代码中： bsObj = BeautifulSoup(html, "html.parser") 改为 bsObj = BeautifulSoup(html) 会报以下警告：大概意思是，让你指明解析器，不然就使用 "html.parser" 解析器（ HTML parser 是 Python 自带的解析库），但别人使用的解析器不一定是 "html.parser" 呀，所以最好指定一下。

bs-zsh

bsobj

需求：爬取 https://readhub.cn/topics 页面上全部新闻标题

re — regex（regular expression）正则表达式操作 https://docs.python.org/zh-cn/3.6/library/re.html
- 使用正则过滤 HTML 标签 https://docs.python.org/zh-cn/3.6/library/re.html#re.sub 语法： re.sub(pattern, repl, string, count=0, flags=0) 在 string 找到的第一个 pattern ，更换为 repl，并返回整个字符串。如果没找到样式，就直接返回 string ， repl 可以是字符串或者函数。实例： titleStr = re.sub('<.*?>','',titleStr) 更多参考：https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
列表转换为字符串
- 语法： str = ''.join(str(e) for e in list)
get_text() 方法：获取到 tag 中包含的所有文本内容包括子孙 tag 中的内容，并将结果作为 Unicode 字符串返回。 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#get-text

import re
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

# html = urlopen("https://readhub.cn/topics")
# bsObj = BeautifulSoup(html)
# print(bsObj.h2.get_text())  # 只获取第一条h2标题

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        title = bsObj.find_all("h2", {"class": "topicTitle___1HWIA"})
    except AttributeError as e:
        return None
    return title

title = getTitle("https://readhub.cn/topics")

if title == None:
    print("找不到文章")
else:
    # print(title)
    # 列表转为字符串：str = ''.join(str(e) for e in list)
    titleStr = '\n'.join(str(e) for e in title)
    # 使用正则过滤HTML标签
    # <.*?> 含义：https://docs.python.org/zh-cn/3.6/library/re.html ==> 搜索 *?
    titleStr = re.sub('<.*?>', '', titleStr)
    print(titleStr)

输出结果：

re-delete-tag

Qingquan-Li / blog

使用BeautifulSoup库爬取新闻标题 #108