New behaviour of how text is extracted from a page

mtwords commented 6 years ago

Hi everyone

I just updated from the original iText 4.2.0 (https://github.com/ymasory/iText-4.2.0) to your OpenPDF 1.0.5. So far, it works fine, but I mentioned a change to the behaviour how text is extracted from a pdf. With the previous version, the text has been extracted via PdfTextExtractor.getTextFromPage(i) as "plain text", now I get every word surrounded by markup tags.

For example: before: Hello

after: <br class='t-pdf' /><span class="t-word" style="bottom: 81.79%; left: 56.18%; width: 17.45%; height: 0.83%;" id="word7">Hello</span>

I found out, that this change has been made by the following fork respectively the following change: https://github.com/kulatamicuda/iText-4.2.0/commit/7d7c218a39815cfea17f9d7c522198b52bb4551a#diff-b2e0f949a7f5d2e581f63cedf5f30922

Is there a way to get the old behaviour without using the old "SimpleTextExtractingPdfContentRenderListener" class? I don't want to integrate old code because of maintainability...

Thanks in advance! M.T.

P.S.: I know, this change has been made by another repository, but the original repository has not been updated since at least 3 years...

asturio commented 6 years ago

@daviddurand do you remember, if there is an alternative for the text extraction no being so verbose?

daviddurand commented 6 years ago

So, this was something that occurred to me when you guys started with my code -- as, while I fixed a number of bugs, I was also pretty thoroughly reworking the text extraction (to meet the needs of my company). This included lots of bug fixes (and more to come), but also extra result information, returned in the form of HTML (not the best idea long term).

The new protocol to extract text includes a markup parameter (the true in my sample utility method below). This does not currently control the verbose word markup, but it should. if @mtwords needs a quick fix for the time being, they could change the result code in Word.java, which is currently:

        result.append("<span class=\"t-word\" style=\"bottom: ")
                .append(formatPercent(resultRect.getBottom())).append("; left: ")
                .append(formatPercent(resultRect.getLeft())).append("; width: ")
                .append(formatPercent(resultRect.getWidth())).append("; height: ")
                .append(formatPercent(resultRect.getHeight())).append(";\"")
                .append(" id=\"").append(myId).append("\">")
                .append(escapeHTML(text)).append(" ");

Changing this to result.append(text).append(' '); ought to do it. I will amend the code so that the option passed to the PdfTextExtractor will be passed into the Word, to control all the markup, as well as check for any other places it ought to be suppressed.

I also want to restore the text-extraction tests.

Sample utility method I use in my code (change true to false to suppress PDF pseudo markup, and the <br /> tags) :

    private String extractAccessibleTextFromReader(PdfReader pdfReader,
                                                   int pageIdx) {
        PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(pdfReader, true);
        String result = "";
        try {
            result = pdfTextExtractor.getTextFromPage(pageIdx);
        } catch (Exception e) {
            throw new TizraException("error parsing PDF file", e);
        }

        return result;
    }

daviddurand commented 6 years ago

Just to be clear, I don't think it would be good to go back to the original text extraction, as a lot of bugs were removed.

daviddurand commented 6 years ago

I believe that #76 provided what @mtwords needed -- a markup-free text extraction option. I'm closing this issue on that account.

LibrePDF / OpenPDF

New behaviour of how text is extracted from a page #75