VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
14.65k stars 764 forks source link

Segmenting Markdown-converted PDFs into pages #86

Closed umarbutler closed 1 month ago

umarbutler commented 5 months ago

Hi @VikParuchuri, Thank you very much for creating this invaluable package which I have found extremely useful in several projects already. I just wanted to ask if an option could be added to indicate where pages start and end in the outputted Markdown? Even having the ability to add a custom delimiter such as <page> would help.

umarbutler commented 5 months ago

For anyone else interested in preserving page boundaries, I managed to add a page delimiter by:

  1. Replacing the merge_lines() function in markdown.py with the following:

    def merge_lines(blocks, page_blocks: List[Page]):
        text_blocks = []
        prev_type = None
        prev_line = None
        block_text = ""
        block_type = ""
        common_line_heights = [p.get_line_height_stats() for p in page_blocks]
        for page_i, page in enumerate(blocks):
            for block in page:
                block_type = block.most_common_block_type()
                if block_type != prev_type and prev_type:
                    text_blocks.append(
                        FullyMergedBlock(
                            text=block_surround(block_text, prev_type),
                            block_type=prev_type
                        )
                    )
                    block_text = ""
    
                prev_type = block_type
                # Join lines in the block together properly
                for i, line in enumerate(block.lines):
                    line_height = line.bbox[3] - line.bbox[1]
                    prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
                    prev_line_x = prev_line.bbox[0] if prev_line else 0
                    prev_line = line
                    is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x
    
                    if block_text:
                        block_text = line_separator(block_text, line.text, block_type, is_continuation)
                    else:
                        block_text = line.text
    
            # This is where the magic happens!
            if page_i != len(blocks) - 1:
                block_text += ''
            # This is where the magic ends!
    
        # Append the final block
        text_blocks.append(
            FullyMergedBlock(
                text=block_surround(block_text, prev_type),
                block_type=block_type
            )
        )
        return text_blocks
  2. Replacing lowercase_letters = "a-zà-öø-ÿа-яşćăâđêôơưþðæøå" in the line_seperator() function of markdown.py with lowercase_letters = "a-zà-öø-ÿа-яşćăâđêôơưþðæøå". This ensures that delimiters do not cause newlines to be inserted in the middle of lines.

This uses (Unicode's object replacement character) instead of <page> as it is a single character and can therefore be added directly to the lowercase_letters regex character set instead of having to rework regex patterns. You may replace it with any other character of your choosing.

This is a bit of a hacky solution so I'd still like to see page segmentation implemented officially in marker.

nunamia commented 5 months ago

YES, You need edit schema.py

image

and edit markdown.py `def merge_lines(blocks, page_blocks: List[Page]): text_blocks = [] prev_type = None prev_line = None block_text = "" block_type = "" block_pnum = 0 common_line_heights = [p.get_line_height_stats() for p in page_blocks] for page in blocks: for block in page: block_pnum = block.pnum block_type = block.most_common_block_type() if block_type != prev_type and prev_type: text_blocks.append( FullyMergedBlock( text=block_surround(block_text, prev_type), block_type=prev_type, pnum=block_pnum ) ) block_text = "" prev_type = block_type

Join lines in the block together properly

        for i, line in enumerate(block.lines):
            line_height = line.bbox[3] - line.bbox[1]
            prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
            prev_line_x = prev_line.bbox[0] if prev_line else 0
            prev_line = line
            is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x
            if block_text:
                block_text = line_separator(block_text, line.text, block_type, is_continuation)
            else:
                block_text = line.text

# Append the final block
text_blocks.append(
    FullyMergedBlock(
        text=block_surround(block_text, prev_type),
        block_type=block_type,
        pnum=block_pnum
    )
)
return text_blocks`
image
Terranic commented 2 months ago

@nunamia How about making a merge of this solution?

However, I´m observing issues with the page numbers. I have a document vom EU Parliament where every page has content but the page numbers are too often and jump

image

umarbutler commented 2 months ago

@Terranic Try out my solution, I haven't found that issue with it.

VikParuchuri commented 2 months ago

Thanks for the script @umarbutler . This is on my list of features to include, as a few people have asked for it

HaileyStorm commented 2 months ago

Here's a script to monkeypatch Marker with @umarbutler 's solution:

import ast
import inspect
import marker.postprocessors.markdown

class MarkdownTransformer(ast.NodeTransformer):
    def __init__(self):
        self.current_function = None

    def visit_FunctionDef(self, node):
        # Store the current function name
        self.current_function = node.name
        # Visit all the child nodes within the function
        self.generic_visit(node)
        # Reset current function name to None after leaving the function
        self.current_function = None
        return node

    def visit_Assign(self, node):
        if self.current_function == 'line_separator':
            if isinstance(node.targets[0], ast.Name) and node.targets[0].id == 'lowercase_letters':
                if isinstance(node.value, ast.Constant) and isinstance(node.value.value, str):
                    original_value = node.value.value  # might want node.value.s
                    new_value = original_value + '|'
                    node.value = ast.Constant(value=new_value)
        return node

    def visit_For(self, node):
        if self.current_function == 'merge_lines':
            # Check if the loop iterates over a variable named 'page'
            if isinstance(node.target, ast.Name) and node.target.id == 'page':
                # Change the loop to use enumerate
                node.iter = ast.Call(
                    func=ast.Name(id='enumerate', ctx=ast.Load()),
                    args=[node.iter],
                    keywords=[]
                )
                node.target = ast.Tuple(elts=[
                    ast.Name(id='page_i', ctx=ast.Store()),
                    ast.Name(id='page', ctx=ast.Store())
                ], ctx=ast.Store())

                # Create the additional check and append operation
                page_check = ast.parse("""
if page_i != len(blocks) - 1:
    block_text += ''
""").body[0]
                node.body.append(page_check)
        return node

# Get the source code and make the AST
markdown_source = inspect.getsource(marker.postprocessors.markdown)
markdown_ast = ast.parse(markdown_source)

# Create the AST transformer instance
markdown_transformer = MarkdownTransformer()

# Perform the transformation (explores the tree and applies defined transformation functions, returning the new tree)
markdown_ast = markdown_transformer.visit(markdown_ast)
# Fix missing locations in the modified AST
ast.fix_missing_locations(markdown_ast)

# Replace the functions in the actual module - e.g. internal module calls to
# marker.postprocessors.markdown.line_separator will use the updated version.
exec(compile(markdown_ast, filename='<ast>', mode='exec'), marker.postprocessors.markdown.__dict__)
knysfh commented 1 month ago

Less debugging for others,the method of using @umarbutler requires changing the two files marker/schema/merged.py and marker/postprocessors/markdown.py

note:tested on marker-pdf==0.2.5

merged.py

from collections import Counter
from typing import List, Optional

from pydantic import BaseModel

from marker.schema.bbox import BboxElement

class MergedLine(BboxElement):
    text: str
    fonts: List[str]

    def most_common_font(self):
        counter = Counter(self.fonts)
        return counter.most_common(1)[0][0]

class MergedBlock(BboxElement):
    lines: List[MergedLine]
    pnum: int
    block_type: Optional[str]

class FullyMergedBlock(BaseModel):
    text: str
    block_type: str
    pnum: int

markdown.py,replace merge_lines function.

def merge_lines(blocks: List[List[MergedBlock]]):
    text_blocks = []
    prev_type = None
    prev_line = None
    block_text = ""
    block_type = ""
    block_pnum = 0
    # common_line_heights = [p.get_line_height_stats() for p in page_blocks]
    for page_i, page in enumerate(blocks):
        for block in page:
            block_pnum = block.pnum
            block_type = block.block_type
            if block_type != prev_type and prev_type:
                text_blocks.append(
                    FullyMergedBlock(
                        text=block_surround(block_text, prev_type),
                        block_type=prev_type,
                        pnum=block_pnum
                    )
                )
                block_text = ""

            prev_type = block_type
            # Join lines in the block together properly
            for i, line in enumerate(block.lines):
                line_height = line.bbox[3] - line.bbox[1]
                prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
                prev_line_x = prev_line.bbox[0] if prev_line else 0
                prev_line = line
                is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x

                if block_text:
                    block_text = line_separator(block_text, line.text, block_type, is_continuation)
                else:
                    block_text = line.text

        # This is where the magic happens!
        if page_i != len(blocks) - 1:
            block_text += ''
        # This is where the magic ends!

    # Append the final block
    text_blocks.append(
        FullyMergedBlock(
            text=block_surround(block_text, prev_type),
            block_type=block_type,
            pnum=block_pnum
        )
    )
    return text_blocks