TomRoush / PdfBox-Android

The Apache PdfBox project ported to work on Android
Apache License 2.0
1.01k stars 259 forks source link

Very slow extracting text #139

Open adepase opened 6 years ago

adepase commented 6 years ago

I think I'm missed something, because I cannot think it needs tens of seconds (or even minutes) to extract text. Can you please help me? This is my code (I start calling simpleReadPdf):

        try {
            return PDDocument.load(file);
        } catch(IOException e) {
            // Probable encrypted text
            e.printStackTrace();
            return null;
        }
    }
public static String simpleReadPdf(File file, Context context){
        StringBuffer text = null;
        PDFBoxResourceLoader.init(context);
        PDDocument document = FileReaderUtils.getPdfDoc(file, context);

        try {
            text = extractTextFromPDF(document);
        } catch (IOException ioe){
            // Probable encrypted text
            ioe.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (document != null) document.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return text.toString();
    }

public static String extractTextFromPDF(PDDocument doc) throws IOException
    {
        String dataS = null;
        try
        {
            PDFTextStripper textStripper = new PDFTextStripper();
            textStripper.setStartPage(1);
            textStripper.setEndPage(3);
            dataS = textStripper.getText(doc);
        }
        finally
        {
            if (doc != null) doc.close();
        }
       return dataS;
}
TomRoush commented 6 years ago

What are the specs of device are you testing on? Stripping text is slow, but with the amount of pages you're stripping, it shouldn't be taking minutes to strip.

adepase commented 6 years ago

I'm testing on a Samsung s7 edge (so, a pretty good hardware) and trying to strip the attached pdf. 5 seconds only for pages from 0 to 3 (BTW: according to the docs it should start with 1 and be inclusive, but if I start from 1 I miss the front page) confessioni[1].pdf

Thank you

TomRoush commented 6 years ago

5 seconds for those pages is about what I would expect and similar to my time. As I mentioned before, text stripping is slow. The start page should be 1-indexed as you said, I'll look into why it's 0-indexed.

arsh-7 commented 6 years ago

extracting text from sample is fast whereas when i use it in my own app it slows down massively any idea to fix it ?

Thank you for this library.

TomRoush commented 6 years ago

@pro-preet Its slower using the same PDF stripping in your code than it is stripping from sample?

mobilecityCZ commented 5 years ago

I have a similar issue - text extracting is very slow, but only if the phone (S7) is connected with PC. (tested with small PDF, 20 words only...)

jenmo917 commented 5 years ago

+1

peterdk commented 4 years ago

Was looking to easily extract text, but a single page PDF with 35 lines of actual content, takes 20s or so on a fairly recent (Nokia 8.1 Android 10) device. Did not expect that.

I was expecting that the text is already present in the PDF format, so it's just a simple extraction? Apparently not?

update If you are looking to extract text in sub 1s time, I just found https://github.com/benjinus/android-support-pdfium which works very fast.

pranayzv commented 4 years ago

Try using thread to get data before its needed. you have to design an algorithm for when you want/need the data. hope it helps!