Closed grinay closed 1 year ago
I'm not familiar with that library/tool but I think something like Docstrum might do what you need https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#docstrum-for-bounding-boxes-method this groups words into sections/groups based on the layout in the document.
There are various layout analysis tools that can help. But if you're after something where the output is like:
1 Road Lane
Town
ZIP
Dear Person,
This is a traditional letter fo-
rmat, etc, lorem ipsum blah
blah blah.
Kind Regards,
Sender
Then we don't have any tools to do that currently.
@EliotJones yes, that exactly what I'm looking for.
@grinay it won't be out of the box but you could write your own routine to do so as you can get the location of letters/words/paragraphs
@grinay you could start something like this, though it doesn't quite work correctly with multi-column data. Here we group words into lines with a tolerance of 7 units. Then we either append a space or multiple spaces based on the word distance from the previous. You'd need a lot more tuning to get a satisfactory output and using Docstrum might be a better starting point but it gives a possible approach:
var sb = new StringBuilder();
using (var document = PdfDocument.Open(file, new ParsingOptions { UseLenientParsing = false }))
{
var p1 = document.GetPage(1);
var words = p1.GetWords();
var lines = words.GroupBy(x => (int)Math.Round((x.Letters[0].StartBaseLine.Y / 7.0) * 7));
foreach (var line in lines)
{
Word previousWord = null;
foreach (var word in line.OrderBy(x => x.BoundingBox.Left))
{
if (previousWord != null)
{
var gap = word.BoundingBox.Left - previousWord.BoundingBox.Right;
var spaceSize = word.Letters[0].Width * 2;
if (gap > spaceSize)
{
sb.Append(' ', (int)(gap / spaceSize));
}
sb.Append(word).Append(" ");
}
else
{
sb.Append(word).Append(" ");
}
previousWord = word;
}
sb.AppendLine();
}
var text = sb.ToString();
An example output:
Old Gutnish - Wikipedia Page 1 of 3
Old Gutnish
Old Gutnish was the dialect of Old Norse that was
spoken on the Baltic island of Gotland. It shows sufficient
differences from the Old West Norse and Old East Norse
dialects that it is considered to be a separate branch.
Gutnish is still spoken in some parts of Gotland and on the
adjoining island of Fårö.
The root Gut is identical to Goth, and it is often remarked
that the language has similarities with the Gothic
language. These similarities have led scholars such as
The approximate extent of Old Norse and
Elias Wessén and Dietrich Hofmann to suggest that it is
related languages in the early 10th
most closely related to Gothic. The best known example of
century:
such a similarity is that Gothic and Gutnish called both
Old West Norse dialect
adult and young sheep lamb.
Old East Norse dialect
The Old Norse diphthong au (e.g. auga "eye") remained in Old Gutnish
Old Gutnish and Old West Norse, while in Old East
Old English
Norse – except for peripheral dialects – it evolved into the
Crimean Gothic
monophthong ǿ, i.e. a long version of ø. Likewise the
Other Germanic languages with
diphthong ai in bain (bone) remained in Old Gutnish
which Old Norse still retained some
while it in Old West Norse became ei as in bein and in Old
mutual intelligibility
East Norse it became é (bén). Whereas Old West Norse
had the ey diphthong and Old East Norse evolved the
monophthong ǿ) Old Gutnish had oy.
Proto-Germanic Old Gutnish Old West Norse Old East Norse
*augô (eye) auga auga auga > ǿga
*bainą (bone) bain bein bæin > bén
*hauzijaną (to hear) hoyra heyra høyra > hǿra
Most of the corpus of Old Gutnish is found in the Gutasaga from the 13th century.
Language sample
Citation:
Þissi þieluar hafþi ann sun sum hit hafþi. En hafþa cuna hit huita stierna þaun tu bygþu fyrsti
agutlandi fyrstu nat sum þaun saman suafu þa droymdi hennj draumbr. So sum þrir ormar
warin slungnir saman j barmj hennar Oc þytti hennj sum þair scriþin yr barmi hennar. þinna
draum segþi han firi hasþa bonda sinum hann riaþ dravm þinna so. Alt ir baugum bundit bo
land al þitta warþa oc faum þria syni aiga. þaim gaf hann namn allum o fydum. guti al
https://en.wikipedia.org/wiki/Old_Gutnish 10/01/2018
@EliotJones Hey. I'm curious if there is a way to extract text with preserving layout. The example of such text extraction can be found in pdftotext from poppler tool set. Do you have an idea how to implement that?