UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Extract text with preserving layout. #630

Closed grinay closed 1 year ago

grinay commented 1 year ago

@EliotJones Hey. I'm curious if there is a way to extract text with preserving layout. The example of such text extraction can be found in pdftotext from poppler tool set. Do you have an idea how to implement that?

EliotJones commented 1 year ago

I'm not familiar with that library/tool but I think something like Docstrum might do what you need https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#docstrum-for-bounding-boxes-method this groups words into sections/groups based on the layout in the document.

There are various layout analysis tools that can help. But if you're after something where the output is like:

                      1 Road Lane
                             Town
                              ZIP

Dear Person,

This is a traditional letter fo-
rmat, etc, lorem ipsum blah
blah blah.

Kind Regards,

Sender

Then we don't have any tools to do that currently.

grinay commented 1 year ago

@EliotJones yes, that exactly what I'm looking for.

BobLd commented 1 year ago

@grinay it won't be out of the box but you could write your own routine to do so as you can get the location of letters/words/paragraphs

EliotJones commented 1 year ago

@grinay you could start something like this, though it doesn't quite work correctly with multi-column data. Here we group words into lines with a tolerance of 7 units. Then we either append a space or multiple spaces based on the word distance from the previous. You'd need a lot more tuning to get a satisfactory output and using Docstrum might be a better starting point but it gives a possible approach:

 var sb = new StringBuilder();
 using (var document = PdfDocument.Open(file, new ParsingOptions { UseLenientParsing = false }))
 {
     var p1 = document.GetPage(1);

     var words = p1.GetWords();

     var lines = words.GroupBy(x => (int)Math.Round((x.Letters[0].StartBaseLine.Y / 7.0) * 7));

     foreach (var line in lines)
     {
         Word previousWord = null;
         foreach (var word in line.OrderBy(x => x.BoundingBox.Left))
         {
             if (previousWord != null)
             {
                 var gap = word.BoundingBox.Left - previousWord.BoundingBox.Right;

                 var spaceSize = word.Letters[0].Width * 2;
                 if (gap > spaceSize)
                 {
                     sb.Append(' ', (int)(gap / spaceSize));
                 }

                 sb.Append(word).Append(" ");
             }
             else
             {
                 sb.Append(word).Append(" ");
             }

             previousWord = word;
         }

         sb.AppendLine();
     }

     var text = sb.ToString();

An example output:

Old Gutnish - Wikipedia                        Page 1 of 3 
Old Gutnish 
Old Gutnish was the dialect of Old Norse that was 
spoken on the Baltic island of Gotland. It shows sufficient 
differences from the Old West Norse and Old East Norse 
dialects that it is considered to be a separate branch. 
Gutnish is still spoken in some parts of Gotland and on the 
adjoining island of Fårö. 
The root Gut is identical to Goth, and it is often remarked 
that  the  language has similarities with  the Gothic 
language. These similarities have  led scholars such as 
The approximate extent of Old Norse and 
Elias Wessén and Dietrich Hofmann to suggest that it is 
related languages in the early 10th 
most closely related to Gothic. The best known example of 
century: 
such a similarity is that Gothic and Gutnish called both 
Old West Norse dialect 
adult and young sheep lamb. 
Old East Norse dialect 
The Old Norse diphthong au (e.g. auga "eye") remained in   Old Gutnish 
Old Gutnish and Old West Norse, while  in Old East 
Old English 
Norse – except for peripheral dialects – it evolved into the 
Crimean Gothic 
monophthong ǿ, i.e. a long version of ø. Likewise the 
Other Germanic languages with 
diphthong ai in bain (bone) remained in Old Gutnish 
which Old Norse still retained some 
while it in Old West Norse became ei as in bein and in Old 
mutual intelligibility 
East Norse it became é (bén). Whereas Old West Norse 
had the ey diphthong and Old East Norse evolved the 
monophthong ǿ) Old Gutnish had oy. 
Proto-Germanic  Old Gutnish Old West Norse Old East Norse 
*augô (eye)     auga     auga      auga > ǿga 
*bainą (bone)    bain     bein       bæin > bén 
*hauzijaną (to hear)  hoyra     heyra      høyra > hǿra 
Most of the corpus of Old Gutnish is found in the Gutasaga from the 13th century. 
Language sample 
Citation: 
Þissi þieluar hafþi ann sun sum hit hafþi. En hafþa cuna hit huita stierna þaun tu bygþu fyrsti 
agutlandi fyrstu nat sum þaun saman suafu þa droymdi hennj draumbr. So sum þrir ormar 
warin slungnir saman j barmj hennar Oc þytti hennj sum þair scriþin yr barmi hennar. þinna 
draum segþi han firi hasþa bonda sinum hann riaþ dravm þinna so. Alt ir baugum bundit bo 
land al þitta warþa oc faum þria syni aiga. þaim gaf hann namn allum o fydum. guti al 
https://en.wikipedia.org/wiki/Old_Gutnish                    10/01/2018