flexpaper / pdf2json

PDF2JSON is a conversion library based on XPDF (3.02) which can be used for high performance PDF page by page conversion to JSON and XML format. It also supports compressing data to minimize size. PDF2JSON is available for Windows, OSX and Linux. Please see https://flowpaper.com for more information
305 stars 52 forks source link

Problem with content in one line being seperated to multiple ones #46

Open arthur798 opened 4 years ago

arthur798 commented 4 years ago

Hi guys,

I am trying to find signatures in a document by checking each line and seeing if it has the code signature however when doing that I noticed it splits content which is in the same into multiple lines for some reason, how can I tackle this?

2020-07-17T09:55:54.404Z    adfcfc04-637c-4228-b256-6a5b3214308c    INFO    Signed%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6.
2020-07-17T09:55:54.404Z    adfcfc04-637c-4228-b256-6a5b3214308c    INFO    SIGN_ABOVE_HERE
2020-07-17T09:55:54.423Z    adfcfc04-637c-4228-b256-6a5b3214308c    INFO    _
2020-07-17T09:55:54.423Z    adfcfc04-637c-4228-b256-6a5b3214308c    INFO    JOHN
2020-07-17T09:55:54.423Z    adfcfc04-637c-4228-b256-6a5b3214308c    INFO    _
2020-07-17T09:55:54.423Z    adfcfc04-637c-4228-b256-6a5b3214308c    INFO    Signed%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6%E2%80%A6.
2020-07-17T09:55:54.423Z    adfcfc04-637c-4228-b256-6a5b3214308c    INFO    SIGN
2020-07-17T09:55:54.423Z    adfcfc04-637c-4228-b256-6a5b3214308c    INFO    _ABOVE_HERE_SUSAN_

I am attaching a doc so u can see how it looks like too doc.docx