Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

feat/Allow max-pages/max-total-characters that should be parsed #3137

Open abdullahbaa5 opened 1 month ago

abdullahbaa5 commented 1 month ago

My service allows only 20k characters which is around 6 pages of an pdf file, but if someone uploads a 200+ pages pdf, it takes 6minutes to process after which i check how much characters are there in the file.

is there a feature in unstructured that stops the processing automatically if x amount of total characters has been reached? (plus include in the response that the whole file was not processed and cut off).

scanny commented 1 month ago

@abdullahbaa5 You can implement this behavior to suit your particular use case with some modest pre-processing, something like this:

from pypdf import PdfReader, PdfWriter

max_pages = 6

input_pdf = PdfReader("document.pdf")
output_pdf = PdfWriter()

for p in input_pdf.pages[:max_pages]:
    output_pdf.add_page(p)

output_pdf.write("first_six_pages.pdf")
abdullahbaa5 commented 1 month ago

@abdullahbaa5 You can implement this behavior to suit your particular use case with some modest pre-processing, something like this:

from pypdf import PdfReader, PdfWriter

max_pages = 6

input_pdf = PdfReader("document.pdf")
output_pdf = PdfWriter()

for p in input_pdf.pages[:max_pages]:
    output_pdf.add_page(p)

output_pdf.write("first_six_pages.pdf")

Yes but that will be problematic at the same time if the page only has 100 characters each but we allow a max of 20k characters hence it could be 12 pages sometimes or a bit more.