Erratic Spaces in Words while reading from PDF

dpatro commented 10 years ago

Hi,

I have just started using PDFminer. Its solves my purpose of reading PDF along with preserving the fonts information.

But I am facing some issues with th read content. There are some erratic white spaces coming in the words. Attached pics.

screenshot from 2013-11-04 16 56 06

screenshot from 2013-11-04 16 59 51

The Source pdf is : http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx

I have also done 'print to pdf' before running the tool 'pdf2text' on it.

medecau commented 10 years ago

pdf2txt is untrustable.

Everytime I use it I export to xml and then write an ectractor script using BeautifulSoup.

dpatro commented 10 years ago

@medecau I tried your way. But there is a problem with the size of xml being generated. Also it is taking a lot of time even for a 2 MB PDF. Is there a scalable solution to this? I will be using this program in a production environment. If you have any implementation ready can you please refer to it?

Also, are there any other alternative solutions to this problem?

medecau commented 10 years ago

I don't know of any solution that would let you build your on programmatic extraction. I haven't done much exploring in that regard mostly because the content I usually extract is tabular and the extraction process requires some "smart" selection of the contents.

Regarding the size and time it takes to process install directly from github, the version available on pypi seems old and slower. Then try loading the xml like this:

```BeautifulSoup(contents, 'lxml')```

One other thing you can try is loading the pdf in memory and work with it using the pdfminer api. I was unable to understand the data structure used for pdfminer and went with the steps described. You can even try this with PyPDF2 which seems to be pretty fast.

dpatro commented 10 years ago

@medecau Thanks for the quick reply!

Will give the github source a try tomorrow and update.
I agree that the API should work a bit faster as the generated XML will be in memory instead of being written to a file. Till now I was stuck with finding the proper documentation. I think this is the one I should have been referring to: https://github.com/euske/pdfminer/blob/master/docs/programming.html
My requirement is to have the font size details along with the text from the PDF. For this reason I didn't go for pyPDF.

Thanks again, will be updating soon :)

medecau commented 10 years ago

Note that I referred to PyPDF2 not PyPDF. And I only meant that for speed. It is orders of magnitude faster. I think it has to do with JIT parsing.

To install from the repo:

pip install -U https://github.com/euske/pdfminer/archive/master.zip

I just wish that there would be a way of interacting with the PDF the same way one is able to interact with a BeautifulSoup object after loading the xml generated by pdf2txt.py

lVlayhem commented 10 years ago

I have whitespace in words too.

euske commented 10 years ago

You can disable automatic space insertion by giving -W0 option, or -n to disable the entire layout analysis.

Some PDFs don't have a space between words. So pdf2txt.py tries to insert a space when there's a significant blank between letters. I made a change to make it more robust.

kwk commented 10 years ago

esuke, are you referring to this commit 4ef81ae9d8278c3fd5a53d3b62eb2194c86cdb80 ?

dpatro commented 10 years ago

@euske Thanks for the suggestion. I will try it out. I was about to contact you in this regard! The XML path was taking very long time as I had to do character wise parsing.

dpatro commented 10 years ago

@euske The problem with spaces seems to have been controlled with -W0, but now a very few words are getting clubbed. :) Anyways, with -W0 the output is much cleaner.

[Off topic] Is there any way of getting images out of the pdf? The linux tool command, pdftohtml seems to be doing a fair job in capturing both text and images, but its doesn't give out a nice output like pdfMiner.

euske commented 10 years ago

Is there any way of getting images out of the pdf? The linux tool command, pdftohtml seems to be doing a fair job in capturing both text and images, but its doesn't give out a nice output like pdfMiner.

pdf2txt.py can extract images by giving -O (output directory) option, but the function isn't tested that much yet, and I'm sure there are some images that are not supported here.