Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.74k stars 615 forks source link

Add manual coordinate constraints to `partition_pdf()`. #3072

Open ChiNoel-osu opened 2 months ago

ChiNoel-osu commented 2 months ago

Is your feature request related to a problem? Please describe. I'm using hi_res strategy for my PDF files because I need to extract all images and tables etc. and the PDF files have headers and footers that I wish to remove. Normally it wouldn't be a problem because I'll just use the element.metadata.coordinates.points to filter them out but in my documents the headers and footers have logos in it which makes it an Image and partition_pdf() will run OCR on it. And OCR is slow. By specifying start/end coordinates, elements that are considered "outside" shall not go through OCR and thus saving time.

Describe the solution you'd like Ability to specify start/end coordinates in partition_pdf() function. Maybe it can be added to other partition functions as well.

Describe alternatives you've considered I came across https://github.com/Unstructured-IO/unstructured/pull/2455 but it's for fast strategy and it doens't allow manual constraints too. Right now I need to wait for the partition_pdf() to finish and filter the header and footers out using element.metadata.coordinates.points.

cleaned_elements = [
    element
    for element in elements
    # Nuke the element even if 1 point is outside the cutcoord.
    if all(
        cutcoord_top < coord[1] < cutcoord_bottom
        for coord in element.metadata.coordinates.points
    )
    and (start_page_number < element.metadata.page_number < stop_page_number)
]

Additional context Nope.

MthwRobinson commented 2 months ago

Hi @ChiNoel-osu ! Thanks for submitting this. Would filtering on the coordinates after partitioning solve your use case?

ChiNoel-osu commented 1 month ago

Hi @ChiNoel-osu ! Thanks for submitting this. Would filtering on the coordinates after partitioning solve your use case?

@MthwRobinson Yes and that's what I'm doing right now. But I think filtering out the elements early before OCR takes place can significantly improve the speed of partitioning process.

huangpan2507 commented 2 weeks ago
cutcoord_top

Hi, @ChiNoel-osu the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?

ChiNoel-osu commented 2 weeks ago

Hi, @ChiNoel-osu the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?

Hi @huangpan2507 they're just custom variables, change them to get different results.

huangpan2507 commented 2 weeks ago

Hi, @ChiNoel-osu the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?

Hi @huangpan2507 they're just custom variables, change them to get different results.

@ChiNoel-osu Thanks for your reply, but I still wonder these custom variables, how to get these variables? I had met a problem about partition_pdf, I want to filter the Header and footer, but oops, it extract something like

"Page2of 29
2024-1-1
Leave Policy xxx in China
Human Resources"

these thing is about the Header, it is in the element .category :CompositeElement. That is to say, when I want to use partition_pdf to deal with pdf file(text, table, pic inside), I print the element.category, the result shows like this: element .category :CompositeElement element .category :Table element .category :Table element .category :CompositeElement element .category :Table element .category :CompositeElement .... then, "Page2of 29 2024-1-1 Leave Policy xxx in China Human Resources" (these words are about Header),they are in the element .category :CompositeElement, how can I filter the Header and footer? Can you help me?

ChiNoel-osu commented 2 weeks ago

@huangpan2507 When you do partition_pdf(), each element will have a coordinate that you can find by going into element.metadata.coordinates.points. Then you can easily check those numbers and filter out elements you don't need. coord[1] in my code is its y coordinate, cutcoord_top and cutcoord_bottom will be based on the location of header and footer in your PDF.

huangpan2507 commented 1 week ago

@huangpan2507 When you do partition_pdf(), each element will have a coordinate that you can find by going into element.metadata.coordinates.points. Then you can easily check those numbers and filter out elements you don't need. coord[1] in my code is its y coordinate, cutcoord_top and cutcoord_bottom will be based on the location of header and footer in your PDF.

Great! Thanks for your kindly help, I will try it in my case.