VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
16.82k stars 955 forks source link

The output markdown file has duplicate content #96

Open Xiaoyuan-xyz opened 6 months ago

Xiaoyuan-xyz commented 6 months ago

I download this file and convert it to markdown.

https://dodcio.defense.gov/Portals/0/Documents/DODAF2/DoDAF%20v2.02%20Chg%201%20Vol%20I%20Final%202015-01-19.pdf

The content on the first few pages is quite normal, but the rest is very confusing and appears many times. like this

## The Dodaf Conceptual Data Model (Cdm) 4.1.1 The Dodaf Conceptual Data Model (Cdm)
he DoDAF conceptual data model (CDM) presents concepts shared by all DoDAF-compliant he DoDAF conceptual data model (CDM) presents concepts shared by all DoDAF
4-1. This diagram may be read in a straightforward way as simple sentences, with the subject and object in the The DoDAF conceptual data model (CDM) presents concepts shared by all DoDAF
architectural descriptions. Key concepts of the CDM are illustrated in may be read in a straightforward way as simple sentences, with the subject and object in the ovals and the predicate on the lines, as follows:
architectural descriptions. Key concepts of the CDM are illustrated in Figure 4
may be read in a straightforward way as simple sentences, with the subject and object in the ovals and the predicate on the lines, as follows: 

I don't know where the problem is, is it the PDF itself?

This is a screenshot of this part.

image

When I copy text directly from the file, I get the following content

 4.1.1 The DoDAF Conceptual Data Model (CDM)
 The DoDAF Conceptual Data Model (CDM)
 he DoDAF conceptual data model (CDM) presents concepts shared by all DoDAF-compliant 
4-1. This diagram 
The DoDAF conceptual data model (CDM) presents concepts shared by all DoDAF
 architectural descriptions. Key concepts of the CDM are illustrated in 
may be read in a straightforward way as simple sentences, with the subject and object in the 
ovals and the predicate on the lines, as follows:
 he DoDAF conceptual data model (CDM) presents concepts shared by all DoDAF
 architectural descriptions. Key concepts of the CDM are illustrated in Figure 4
 may be read in a straightforward way as simple sentences, with the subject and object in the 
ovals and the predicate on the lines, as follows: 

This is the text extracted by PyPDF2

 4.1.1  The DoDAF Conceptual Data Model (CDM) 
The DoDAF conceptual data model (CDM) presents conce pts shared by all DoDAF 
architectural descriptions. Key concepts of the CDM  are illustrated in 
may be read in a straightforward way as simple sent ences, with the subject and object in the 
ovals and the predicate on the lines, as follows: