Closed mtwords closed 6 years ago
@daviddurand do you remember, if there is an alternative for the text extraction no being so verbose?
So, this was something that occurred to me when you guys started with my code -- as, while I fixed a number of bugs, I was also pretty thoroughly reworking the text extraction (to meet the needs of my company). This included lots of bug fixes (and more to come), but also extra result information, returned in the form of HTML (not the best idea long term).
The new protocol to extract text includes a markup parameter (the true
in my sample utility method below). This does not currently control the verbose word markup, but it should. if @mtwords needs a quick fix for the time being, they could change the result code in Word.java, which is currently:
result.append("<span class=\"t-word\" style=\"bottom: ")
.append(formatPercent(resultRect.getBottom())).append("; left: ")
.append(formatPercent(resultRect.getLeft())).append("; width: ")
.append(formatPercent(resultRect.getWidth())).append("; height: ")
.append(formatPercent(resultRect.getHeight())).append(";\"")
.append(" id=\"").append(myId).append("\">")
.append(escapeHTML(text)).append(" ");
Changing this to result.append(text).append(' ');
ought to do it. I will amend the code so that the option passed to the PdfTextExtractor will be passed into the Word, to control all the markup, as well as check for any other places it ought to be suppressed.
I also want to restore the text-extraction tests.
Sample utility method I use in my code (change true
to false
to suppress PDF pseudo markup, and the <br />
tags) :
private String extractAccessibleTextFromReader(PdfReader pdfReader,
int pageIdx) {
PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(pdfReader, true);
String result = "";
try {
result = pdfTextExtractor.getTextFromPage(pageIdx);
} catch (Exception e) {
throw new TizraException("error parsing PDF file", e);
}
return result;
}
Just to be clear, I don't think it would be good to go back to the original text extraction, as a lot of bugs were removed.
I believe that #76 provided what @mtwords needed -- a markup-free text extraction option. I'm closing this issue on that account.
Hi everyone
I just updated from the original iText 4.2.0 (https://github.com/ymasory/iText-4.2.0) to your OpenPDF 1.0.5. So far, it works fine, but I mentioned a change to the behaviour how text is extracted from a pdf. With the previous version, the text has been extracted via PdfTextExtractor.getTextFromPage(i) as "plain text", now I get every word surrounded by markup tags.
For example: before:
Hello
after:
<br class='t-pdf' /><span class="t-word" style="bottom: 81.79%; left: 56.18%; width: 17.45%; height: 0.83%;" id="word7">Hello</span>
I found out, that this change has been made by the following fork respectively the following change: https://github.com/kulatamicuda/iText-4.2.0/commit/7d7c218a39815cfea17f9d7c522198b52bb4551a#diff-b2e0f949a7f5d2e581f63cedf5f30922
Is there a way to get the old behaviour without using the old "SimpleTextExtractingPdfContentRenderListener" class? I don't want to integrate old code because of maintainability...
Thanks in advance! M.T.
P.S.: I know, this change has been made by another repository, but the original repository has not been updated since at least 3 years...