euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

Erratic Spaces in Words while reading from PDF #28

Open dpatro opened 10 years ago

dpatro commented 10 years ago

Hi,

I have just started using PDFminer. Its solves my purpose of reading PDF along with preserving the fonts information.

But I am facing some issues with th read content. There are some erratic white spaces coming in the words. Attached pics.

screenshot from 2013-11-04 16 56 06

screenshot from 2013-11-04 16 59 51

The Source pdf is : http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx

I have also done 'print to pdf' before running the tool 'pdf2text' on it.

medecau commented 10 years ago

pdf2txt is untrustable.

Everytime I use it I export to xml and then write an ectractor script using BeautifulSoup.

dpatro commented 10 years ago

@medecau I tried your way. But there is a problem with the size of xml being generated. Also it is taking a lot of time even for a 2 MB PDF. Is there a scalable solution to this? I will be using this program in a production environment. If you have any implementation ready can you please refer to it?

Also, are there any other alternative solutions to this problem?

medecau commented 10 years ago

I don't know of any solution that would let you build your on programmatic extraction. I haven't done much exploring in that regard mostly because the content I usually extract is tabular and the extraction process requires some "smart" selection of the contents.

Regarding the size and time it takes to process install directly from github, the version available on pypi seems old and slower. Then try loading the xml like this:

```BeautifulSoup(contents, 'lxml')```

One other thing you can try is loading the pdf in memory and work with it using the pdfminer api. I was unable to understand the data structure used for pdfminer and went with the steps described. You can even try this with PyPDF2 which seems to be pretty fast.

dpatro commented 10 years ago

@medecau Thanks for the quick reply!

Thanks again, will be updating soon :)

medecau commented 10 years ago

Note that I referred to PyPDF2 not PyPDF. And I only meant that for speed. It is orders of magnitude faster. I think it has to do with JIT parsing.

To install from the repo:

pip install -U https://github.com/euske/pdfminer/archive/master.zip

I just wish that there would be a way of interacting with the PDF the same way one is able to interact with a BeautifulSoup object after loading the xml generated by pdf2txt.py

lVlayhem commented 10 years ago

I have whitespace in words too.

euske commented 10 years ago

You can disable automatic space insertion by giving -W0 option, or -n to disable the entire layout analysis.

Some PDFs don't have a space between words. So pdf2txt.py tries to insert a space when there's a significant blank between letters. I made a change to make it more robust.

kwk commented 10 years ago

esuke, are you referring to this commit 4ef81ae9d8278c3fd5a53d3b62eb2194c86cdb80 ?

dpatro commented 10 years ago

@euske Thanks for the suggestion. I will try it out. I was about to contact you in this regard! The XML path was taking very long time as I had to do character wise parsing.

dpatro commented 10 years ago

@euske The problem with spaces seems to have been controlled with -W0, but now a very few words are getting clubbed. :) Anyways, with -W0 the output is much cleaner.

[Off topic] Is there any way of getting images out of the pdf? The linux tool command, pdftohtml seems to be doing a fair job in capturing both text and images, but its doesn't give out a nice output like pdfMiner.

euske commented 10 years ago

Is there any way of getting images out of the pdf? The linux tool command, pdftohtml seems to be doing a fair job in capturing both text and images, but its doesn't give out a nice output like pdfMiner.

pdf2txt.py can extract images by giving -O (output directory) option, but the function isn't tested that much yet, and I'm sure there are some images that are not supported here.

speedplane commented 10 years ago

I'm running into a similar issue. When extracting text from the pdf here: https://www.docketalarm.com/cases/International_Trade_Commission/337-819/Certain_Semiconductor_Chips_with_DRAM_Circuitry_and_Modules_and_Products_Containing_Same/docs/509926/1.pdf

You get the following text: https://www.docketalarm.com/cases/International_Trade_Commission/337-819/Certain_Semiconductor_Chips_with_DRAM_Circuitry_and_Modules_and_Products_Containing_Same/509926/1/?text

Notice that spaces are not respected.

speedplane commented 10 years ago

Also note, that the 4ef81ae commit did not help.

euske commented 10 years ago

This is because the part in question is regarded as a "figure" portion, and eliminated from layout analysis. You can give -A option to force it to every portion of a page, but it might take a lot of time for some pages.

speedplane commented 10 years ago

Yes, I see that now. Can you have Chars inside of Figures? Is this a valid PDF?

Regardless, I was able to fix it by adding spacing logic to LTContainer.add (see below). This seems a bit messy because we manually set the word_margin, but it works. Is this a correct approach?

class LTContainer(LTComponent):
...
    def add(self, obj):
        # Perform spacing logic if this is a char
        if isinstance(obj, LTChar) and self._objs:
            margin = .1 * obj.width
            if self._objs[-1].x0 < obj.x0-margin:
                LTContainer.add(self, LTAnon(' '))
        self._objs.append(obj)
        return
speedplane commented 10 years ago

I take that back, I see the correct approach is to set laparams.all_texts = True. However, when I do that, the layout analysis for this PDF returns garbage. I'm still investigating.

speedplane commented 10 years ago

I should probably have opened a new bug report, I don't think my issue is related. It looks like the widths of the LTChars in the given PDF are all zero, which messes up the layout analysis.

speedplane commented 10 years ago

I've opened issue #33.