Open abdullahbaa5 opened 1 month ago
@abdullahbaa5 You can implement this behavior to suit your particular use case with some modest pre-processing, something like this:
from pypdf import PdfReader, PdfWriter
max_pages = 6
input_pdf = PdfReader("document.pdf")
output_pdf = PdfWriter()
for p in input_pdf.pages[:max_pages]:
output_pdf.add_page(p)
output_pdf.write("first_six_pages.pdf")
@abdullahbaa5 You can implement this behavior to suit your particular use case with some modest pre-processing, something like this:
from pypdf import PdfReader, PdfWriter max_pages = 6 input_pdf = PdfReader("document.pdf") output_pdf = PdfWriter() for p in input_pdf.pages[:max_pages]: output_pdf.add_page(p) output_pdf.write("first_six_pages.pdf")
Yes but that will be problematic at the same time if the page only has 100 characters each but we allow a max of 20k characters hence it could be 12 pages sometimes or a bit more.
My service allows only 20k characters which is around 6 pages of an pdf file, but if someone uploads a 200+ pages pdf, it takes 6minutes to process after which i check how much characters are there in the file.
is there a feature in unstructured that stops the processing automatically if x amount of total characters has been reached? (plus include in the response that the whole file was not processed and cut off).