Qingquan-Li / blog

My Blog
https://Qingquan-Li.github.io/blog/
132 stars 16 forks source link

使用BeautifulSoup库爬取新闻标题 #108

Open Qingquan-Li opened 5 years ago

Qingquan-Li commented 5 years ago

开发环境


需求:使用 BeautifulSoup 库获取新闻标题。


#!/usr/bin/python
# -*- coding: UTF-8 -*-

"""
使用BeautifulSoup库获取新闻标题

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库
"""

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

# html = urlopen("https://news.sina.com.cn/c/2019-01-26/doc-ihqfskcp0504445.shtml")
# bsObj = BeautifulSoup(html)
# print(bsObj.h1)

# 捕捉异常,增强代码健壮性。
# 提高代码复用程度,编写通用函数。
def getTitle(url):
    try:
        html = urlopen(url)
    # HTTP 错误:404 Page Not Found、500 Internal Server Error”等。
    # 所有类似情形,urlopen 函数都会抛出“HTTPError”异常。
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        # title = bsObj.h1
        title = bsObj.find_all("h1", {"class": "main-title"})
    # 如果你想要调用的标签不存在,BeautifulSoup 就会返回 None 对象。
    # 不过,如果再调用这个 None 对象下面的子标签,就会发生 AttributeError 错误。
    except AttributeError as e:
        return None
    return title

title = getTitle("https://news.sina.com.cn/c/2019-01-26/doc-ihqfskcp0504445.shtml")

if title == None:
    print("找不到文章")
else:
    print(title)


注意,以上代码中: bsObj = BeautifulSoup(html, "html.parser") 改为 bsObj = BeautifulSoup(html) 会报以下警告: 大概意思是,让你指明解析器,不然就使用 "html.parser" 解析器( HTML parser 是 Python 自带的解析库),但别人使用的解析器不一定是 "html.parser" 呀,所以最好指定一下。

bs-zsh


bsobj



需求:爬取 https://readhub.cn/topics 页面上全部新闻标题


import re
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

# html = urlopen("https://readhub.cn/topics")
# bsObj = BeautifulSoup(html)
# print(bsObj.h2.get_text())  # 只获取第一条h2标题

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        title = bsObj.find_all("h2", {"class": "topicTitle___1HWIA"})
    except AttributeError as e:
        return None
    return title

title = getTitle("https://readhub.cn/topics")

if title == None:
    print("找不到文章")
else:
    # print(title)
    # 列表转为字符串:str = ''.join(str(e) for e in list)
    titleStr = '\n'.join(str(e) for e in title)
    # 使用正则过滤HTML标签
    # <.*?> 含义:https://docs.python.org/zh-cn/3.6/library/re.html ==> 搜索 *?
    titleStr = re.sub('<.*?>', '', titleStr)
    print(titleStr)


输出结果:


re-delete-tag