pdfstructure
detects, splits and organises the documents text content into its natural structure as envisioned by the author.
The document structure, or hierarchy, stores the relation between chapters, sections and their sub sections in a nested, recursive manner.
pdfstructure
is in early development and built on top of pdfminer.six.
pdfminer.high_level.extract_pages()
.class StructuredDocument:
metadata: dict
sections: List[Section]
class Section:
content: TextElement
children: List[Section]
level: int
class TextElement:
text: LTTextContainer # the extracted paragraph from pdfminer
style: Style
Illustration of document structure
The following screenshot contains sections and subsections with their respective content. In that case, the structure can be easily parsed by leveraging the Font Style only.
PDF source: github.com/TSiege
Parse PDF
from pdfstructure.hierarchy.parser import HierarchyParser
from pdfstructure.source import FileSource
parser = HierarchyParser()
# specify source (that implements source.read())
source = FileSource(path)
# analyse document and parse as nested data structure
document = parser.parse_pdf(source)
To export the parsed structure, use a printer implementation.
from pdfstructure.printer import PrettyStringPrinter
stringExporter = PrettyStringPrinter()
prettyString = stringExporter.print(document)
Excerpt of the parsed document (serialized to string)
Parsed data: interview_cheatsheet_pretty.txt
[Search Basics]
[Breadth First Search]
[Definition:]
An algorithm that searches a tree (or graph) by searching levels of the tree first, starting at the root.
It finds every node on the same level, most often moving left to right.
While doing this it tracks the children nodes of the nodes on the current level.
When finished examining a level it moves to the left most node on the next level.
The bottom-right most node is evaluated last (the node that is deepest and is farthest right of it's level).
[What you need to know:]
Optimal for searching a tree that is wider than it is deep.
Uses a queue to store information about the tree while it traverses a tree.
Because it uses a queue it is more memory intensive than depth first search.
The queue uses more memory because it needs to stores pointers
from pdfstructure.printer import JsonFilePrinter
printer = JsonFilePrinter()
file_path = Path("resources/parsed/interview_cheatsheet.json")
printer.print(document, file_path=str(file_path.absolute()))
Parsed data: interview_cheatsheet.json
Excerpt of exported json
{
"metadata": {
"style_info": {
"_data": {
"7.99": 24,
"9.6": 1,
"6.4": 7,
"7.47": 3,
"12.8": 7,
"8.53": 206,
"10.67": 12,
"7.25": 14
},
"_body_size": 8.53,
"_min_found_size": 6.4,
"_max_found_size": 12.8
},
"filename": "interview_cheatsheet.pdf"
},
"elements": [
{
"content": {
"style": {
"bold": true,
"italic": false,
"font_name": ".SFNSDisplay-Semibold",
"mapped_font_size": "xlarge",
"mean_size": 12.8,
"max_size": 12.806323818403143
},
"page": 0,
"text": "Data Structure Basics"
},
"children": [
{
"content": {
"style": {
"bold": true,
"italic": false,
"font_name": ".SFNSDisplay-Semibold",
"mapped_font_size": "large",
"mean_size": 10.6,
"max_size": 10.671936515335972
},
"page": 0,
"text": "Array"
},
"children": [
{
"content": {
"style": {
"bold": true,
"italic": false,
"font_name": ".SFNSText-Semibold",
"mapped_font_size": "middle",
"mean_size": 8.5,
"max_size": 8.537549212268772
},
"page": 0,
"text": "Definition:"
},
....
Of course, encoded documents can be easily decoded and used for further analysis. However, detailed information like bounding boxes or coordinates for each character are not persisted.
from pdfstructure.model.document import StructuredPdfDocument
jsonString = json.load(file)
document = StructuredPdfDocument.from_json(jsonString)
print(document.title)
``
$ "interview_cheatsheet.pdf"
Having all paragraphs and sections organised as a general tree, its straight forward to iterate through the layers and search for specific elements like headlines, or extract all main headers like chapter titles.
Two document traversal generators are available that yield each section in-order
or in level-order
respectively.
from pdfstructure.hierarchy.traversal import traverse_in_order
elements_flat_in_order = [element for element in traverse_in_order(document)]
Exemplary illustration of yield order:
"""
5 10
/ \ \
1 2 3
/ | \ |
a b c x
yield order:
- [5,1,a,b,c,2,10,3,x]
"""
[ ] Detect the document layout type (Columns, Book, Magazine)
The provided layout analysis algorithm by pdfminer.six performs well on more straightforward documents with default settings.
However, more complicated layouts like scientific papers need custom LAParams
settings to retrieve paragraphs in correct reading order.