VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
16.82k stars 955 forks source link

.title() function - alternative for other languages #105

Open BrokenChip231 opened 5 months ago

BrokenChip231 commented 5 months ago

I would like to thank you @VikParuchuri for all of your hard work on this, I am very impressed with my results so far!

I am putting this here because I unfortunately lack this skill to provide a proper PR for my proposed change and I totally understand if you do not have the time to implement it either.

Many languages do not format headings in the same way as in English (i.e capitalizing each word). Instead, typically only the first letter is capitalized.

Accordingly, it would be nice if this could be declared as a flag or even better be assumed based on the document language.

I have solved this issue for myself by modifying the block_surround function in marker/markdown.py as seen below. Of course, my solution cannot be implemented as such in you project since it would not work for those needing English style capitalization.


def block_surround(text, block_type):
    if block_type == "Section-header":
        if not text.startswith("#"):
            words = text.strip().split()
            if words:
                words[0] = words[0].capitalize()  # Capitalize the first word
                text = ' '.join(words)  # Keep other words in lowercase
            text = "\n## " + text + "\n"
    elif block_type == "Title":
        if not text.startswith("#"):
            words = text.strip().split()
            if words:
                words[0] = words[0].capitalize()  # Capitalize the first word
                text = ' '.join(words)  # Keep other words in lowercase
            text = "# " + text + "\n"
    elif block_type == "Table":
        text = "\n" + text + "\n"
    elif block_type == "List-item":
        pass
    elif block_type == "Code":
        text = "\n" + text + "\n"
    return text