Zacharia2 commented 1 year ago

isinstance(chapter, tuple):

是元组的时候就说明是有子集的数据。元组的第一个元素是本层的数据，第二个元素是下一层的数据，也是入口。

Zacharia2 commented 1 year ago

简单来说，用Python解析epub格式的电子书并提取需要的数据，仅仅需要两步：

使用Ebooklib打开epub文件，提取文本内容；使用Beautiful Soup解析文本内容，提取数据。

pip install EbookLib pip install beautifulsoup4

载入电子书

book = epub.read_epub(book_path)

解析

for item in book.get_items():

if item.get_type() == ebooklib.ITEM_DOCUMENT:
    # epub中的内容是html格式，使用BeautifulSoup可以完美解析
    soup = BeautifulSoup(item.get_content(), 'html')

接下来就可以使用BeautifulSoup去解析内容，提取需要的数据。

Ebooklib的更多用法见官方文档：

BeautifulSoup官方文档：

Zacharia2 commented 1 year ago

Zacharia2 commented 1 year ago

锚点：href：'Text/Section0001_0012.xhtml#toc_1'

这样的就获取不到。

在书中，其实有这个页，Text/Section0001_0012.xhtml。但因为锚点，ebook没获取到。

Zacharia2 commented 1 year ago

ok