Open jr-991 opened 3 months ago
PDFBox supports reading, extracting and turning pdf files into jpeg files.
Tesseract can read text from images and scanned documents
Although both libraries were designed to work with different file types, they can be integrated into one another to diversify the functionality of the program as needed.
/ ---------------------------------------------- pdfbox class to read pdf files ---------------------------------------------------------------- / import org.apache.pdfbox.Loader; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File; import java.io.IOException;
public class pdfReader { public static void main(String args[]) throws IOException{ // Loads a document from the chosen directory File file = new File("C:/Users/Juan/IdeaProjects/untitled1/pdf/testPDF.pdf"); PDDocument document = Loader.loadPDF(file);
}