Open dpatro opened 10 years ago
pdf2txt is untrustable.
Everytime I use it I export to xml and then write an ectractor script using BeautifulSoup.
@medecau I tried your way. But there is a problem with the size of xml being generated. Also it is taking a lot of time even for a 2 MB PDF. Is there a scalable solution to this? I will be using this program in a production environment. If you have any implementation ready can you please refer to it?
Also, are there any other alternative solutions to this problem?
I don't know of any solution that would let you build your on programmatic extraction. I haven't done much exploring in that regard mostly because the content I usually extract is tabular and the extraction process requires some "smart" selection of the contents.
Regarding the size and time it takes to process install directly from github, the version available on pypi seems old and slower. Then try loading the xml like this:
```BeautifulSoup(contents, 'lxml')```
One other thing you can try is loading the pdf in memory and work with it using the pdfminer api. I was unable to understand the data structure used for pdfminer and went with the steps described. You can even try this with PyPDF2 which seems to be pretty fast.
@medecau Thanks for the quick reply!
Thanks again, will be updating soon :)
Note that I referred to PyPDF2 not PyPDF. And I only meant that for speed. It is orders of magnitude faster. I think it has to do with JIT parsing.
To install from the repo:
pip install -U https://github.com/euske/pdfminer/archive/master.zip
I just wish that there would be a way of interacting with the PDF the same way one is able to interact with a BeautifulSoup object after loading the xml generated by pdf2txt.py
I have whitespace in words too.
You can disable automatic space insertion by giving -W0 option, or -n to disable the entire layout analysis.
Some PDFs don't have a space between words. So pdf2txt.py tries to insert a space when there's a significant blank between letters. I made a change to make it more robust.
esuke, are you referring to this commit 4ef81ae9d8278c3fd5a53d3b62eb2194c86cdb80 ?
@euske Thanks for the suggestion. I will try it out. I was about to contact you in this regard! The XML path was taking very long time as I had to do character wise parsing.
@euske The problem with spaces seems to have been controlled with -W0, but now a very few words are getting clubbed. :) Anyways, with -W0 the output is much cleaner.
[Off topic] Is there any way of getting images out of the pdf? The linux tool command, pdftohtml seems to be doing a fair job in capturing both text and images, but its doesn't give out a nice output like pdfMiner.
Is there any way of getting images out of the pdf? The linux tool command, pdftohtml seems to be doing a fair job in capturing both text and images, but its doesn't give out a nice output like pdfMiner.
pdf2txt.py can extract images by giving -O (output directory) option, but the function isn't tested that much yet, and I'm sure there are some images that are not supported here.
I'm running into a similar issue. When extracting text from the pdf here: https://www.docketalarm.com/cases/International_Trade_Commission/337-819/Certain_Semiconductor_Chips_with_DRAM_Circuitry_and_Modules_and_Products_Containing_Same/docs/509926/1.pdf
You get the following text: https://www.docketalarm.com/cases/International_Trade_Commission/337-819/Certain_Semiconductor_Chips_with_DRAM_Circuitry_and_Modules_and_Products_Containing_Same/509926/1/?text
Notice that spaces are not respected.
Also note, that the 4ef81ae commit did not help.
This is because the part in question is regarded as a "figure" portion, and eliminated from layout analysis. You can give -A option to force it to every portion of a page, but it might take a lot of time for some pages.
Yes, I see that now. Can you have Chars inside of Figures? Is this a valid PDF?
Regardless, I was able to fix it by adding spacing logic to LTContainer.add
(see below). This seems a bit messy because we manually set the word_margin
, but it works. Is this a correct approach?
class LTContainer(LTComponent):
...
def add(self, obj):
# Perform spacing logic if this is a char
if isinstance(obj, LTChar) and self._objs:
margin = .1 * obj.width
if self._objs[-1].x0 < obj.x0-margin:
LTContainer.add(self, LTAnon(' '))
self._objs.append(obj)
return
I take that back, I see the correct approach is to set laparams.all_texts = True
. However, when I do that, the layout analysis for this PDF returns garbage. I'm still investigating.
I should probably have opened a new bug report, I don't think my issue is related. It looks like the widths of the LTChars in the given PDF are all zero, which messes up the layout analysis.
I've opened issue #33.
Hi,
I have just started using PDFminer. Its solves my purpose of reading PDF along with preserving the fonts information.
But I am facing some issues with th read content. There are some erratic white spaces coming in the words. Attached pics.
The Source pdf is : http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx
I have also done 'print to pdf' before running the tool 'pdf2text' on it.