Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.34k stars 89 forks source link

Fix layout inversion bug #33

Open ic-xu opened 4 months ago

ic-xu commented 4 months ago

description:

Fixed the bug that when parsing PDF, when the PDF content is converted from PPT to a file, the layout of the content is found to be reversed. As shown in the picture below, if calculated from the lower right corner of bbox, rectangle A should be ranked behind B, but if the rectangle has text, the text of rectangle A should be read first in front of rectangle B, so I think Maybe using the upper left corner of the rectangle as the basis for bbox sorting will be more suitable for most people's reading habits.

                ^
            Y  |
               |
               |
               |
               |     +----------------------------------------------+(x1,y1)
               |     |                                              |
               |     |   A                                          |
               |     |                                         (x1,y1)
               |     |            +----------------------------+    |
               |     |            |                            |    |
               |     |            |     B                      |    |
               |     |            |                            |    |
               |     |            |                            |    |
               |     |            +----------------------------+    |
               |     |            (x0,y0)                           |
               |     +----------------------------------------------+
               |    (x0,y0)
       +------------------------------------------------------------------------------------------------>
               +                                                                                       X
                                           +
                                           |
                                           |
                                           |
                                           |
                                           |
                                           |
                                           v
          ^
      Y   |          (x0,y0)
          |          +------------------------------------------------+
          |          |                                                |
          |          |                                                |
          |          |   A           (x0,y0)                          |
          |          |               +--------------------------+     |
          |          |               |                          |     |
          |          |               |   B                      |     |
          |          |               |                          |     |
          |          |               |                          |     |
          |          |               +--------------------------+     |
          |          |                                          (x1,y1)
          |          +------------------------------------------------+
          |                                                           (x1,y,)
          |
+--------------------------------------------------------------------------------->
          |                                                                     X
          |
          +

So I think when switching the coordinate system, (x0, y0) should be kept as the upper left corner point of the rectangle

Filimoa commented 4 months ago

PyMyPdf uses a top-left coordinate system while the rest of our code uses bottom-left. As a result we need to swap these for everything to work. Do you have an example PDF?

ic-xu commented 4 months ago

hi       Nice to receive your email reply

I found a Chinese PDF document on the Internet, but you only need to pay attention to the title of the first page and the order of the email addresses below. Don’t worry too much about the rest, so you only need to look at the parsing results of the first page. 

I have placed the PDF document in the attachment. Finally, I wish you to have a joyful mood every day.

------------------ 原始邮件 ------------------ 发件人: "Filimoa/open-parse" @.>; 发送时间: 2024年4月25日(星期四) 中午11:38 @.>; @.**@.>; 主题: Re: [Filimoa/open-parse] Fix layout inversion bug (PR #33)

PyMyPdf uses a top-left coordinate system while the rest of our code uses bottom-left. As a result we need to swap these for everything to work. Do you have an example PDF?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

从QQ邮箱发来的超大附件

test_layout.pdf (6.36M, 无限期)进入下载页面:https://mail.qq.com/cgi-bin/ftnExs_download?k=7c393535f130ff9cde0d327b1761574c404d075401005302150d0650534c01000d0018540957524e5b5b5350520000535a5f0157316e65175d4a416a5d001c0c4d4d1b455507655e&t=exs_ftn_download&code=89551aec