flexpaper / pdf2json

PDF2JSON is a conversion library based on XPDF (3.02) which can be used for high performance PDF page by page conversion to JSON and XML format. It also supports compressing data to minimize size. PDF2JSON is available for Windows, OSX and Linux. Please see https://flowpaper.com for more information
305 stars 52 forks source link

Passages with spaces joined by periods rather than split into separate words #43

Open gordonbisnor opened 4 years ago

gordonbisnor commented 4 years ago

We have noticed an issue where in somes cases pieces of our text are joined by periods into one massive word, rather than split by spaces into individual array members, eg:

[1027,54,538,27,38,"Churches.set.up.Christian.schools.in.the.early.1800s..Some.Indigenous.peoples.were."]

Not sure if you have any idea what might cause this – if it’s an issue in our PDFs or something that pdf2json is getting wrong for some reason?