[Feature Request]: Split PDF by chapter

pepijnolivier commented 1 month ago

Feature Description

Detect chapters by finding and interpreting a table of contents.
Split the source PDF into multiple PDF's: one per chapter. The table of contents should also have its own PDF.
Optionally label the output files as well; eg 0. Table of contents, 1. Introduction
Ideally we have some configurable chapter tree options
- levels: 1 would only split top-level chapters,
- levels: 2 would split subchapters as well, eg 1.1. Introduction - Installation, 1.2 Introduction - Getting started)

Why is this feature valuable?

This could be useful for many purposes:

Splitting a huge document up in chapters could help teachers providing subsets of materials to their students
It might be more searchable / scannable when looking in a folder
Document indexing and search such as Elasticsearch or Azure cognitive search
- If a huge document is split up into chapters, best-match searches are way more meaningful when the document is split up into chapters. This is also better than splitting up a document into pages, because inside a chapter, we can keep the context about that chapter.

Suggested Implementation

Either be really fancy and auto-detect a table of contents
Or allow to specify that there is a table of contents, let the user specify the page numbers
Interpret each line inside the content of the table of contents: Usually the title is always on the left and page number on the right.
Create a map of the table of contents, let the user confirm it is correct before continuing

Additional Information

To be tested on huge and official documents

No Duplicate of the Feature

[X] I have verified that there are no existing features requests similar to my request.

sbplat commented 2 weeks ago

I think it would be more suitable for this to be 2 separate steps. First, extract the page numbers from the toc and then split it using "Split PDF". For extracting the page numbers, maybe we could have a feature that runs a regex on the text of some page number(s), and outputs that. Could include some common expressions as well to make it easier.

Rudra-241 commented 1 week ago

For PDFs with predefined outlines, check this draft: https://github.com/Stirling-Tools/Stirling-PDF/pull/1786

Stirling-Tools / Stirling-PDF