Open ChiNoel-osu opened 2 months ago
Hi @ChiNoel-osu ! Thanks for submitting this. Would filtering on the coordinates after partitioning solve your use case?
Hi @ChiNoel-osu ! Thanks for submitting this. Would filtering on the coordinates after partitioning solve your use case?
@MthwRobinson Yes and that's what I'm doing right now. But I think filtering out the elements early before OCR takes place can significantly improve the speed of partitioning process.
cutcoord_top
Hi, @ChiNoel-osu the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?
Hi, @ChiNoel-osu the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?
Hi @huangpan2507 they're just custom variables, change them to get different results.
Hi, @ChiNoel-osu the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?
Hi @huangpan2507 they're just custom variables, change them to get different results.
@ChiNoel-osu Thanks for your reply, but I still wonder these custom variables, how to get these variables? I had met a problem about partition_pdf, I want to filter the Header and footer, but oops, it extract something like
"Page2of 29
2024-1-1
Leave Policy xxx in China
Human Resources"
these thing is about the Header, it is in the element .category :CompositeElement.
That is to say, when I want to use partition_pdf to deal with pdf file(text, table, pic inside),
I print the element.category, the result shows like this:
element .category :CompositeElement element .category :Table element .category :Table element .category :CompositeElement element .category :Table element .category :CompositeElement ....
then,
"Page2of 29
2024-1-1
Leave Policy xxx in China
Human Resources" (these words are about Header),they are in the element .category :CompositeElement, how can I filter the Header and footer? Can you help me?
@huangpan2507 When you do partition_pdf()
, each element will have a coordinate that you can find by going into element.metadata.coordinates.points
. Then you can easily check those numbers and filter out elements you don't need.
coord[1]
in my code is its y coordinate, cutcoord_top
and cutcoord_bottom
will be based on the location of header and footer in your PDF.
@huangpan2507 When you do
partition_pdf()
, each element will have a coordinate that you can find by going intoelement.metadata.coordinates.points
. Then you can easily check those numbers and filter out elements you don't need.coord[1]
in my code is its y coordinate,cutcoord_top
andcutcoord_bottom
will be based on the location of header and footer in your PDF.
Great! Thanks for your kindly help, I will try it in my case.
Is your feature request related to a problem? Please describe. I'm using
hi_res
strategy for my PDF files because I need to extract all images and tables etc. and the PDF files have headers and footers that I wish to remove. Normally it wouldn't be a problem because I'll just use theelement.metadata.coordinates.points
to filter them out but in my documents the headers and footers have logos in it which makes it anImage
andpartition_pdf()
will run OCR on it. And OCR is slow. By specifying start/end coordinates, elements that are considered "outside" shall not go through OCR and thus saving time.Describe the solution you'd like Ability to specify start/end coordinates in
partition_pdf()
function. Maybe it can be added to other partition functions as well.Describe alternatives you've considered I came across https://github.com/Unstructured-IO/unstructured/pull/2455 but it's for
fast
strategy and it doens't allow manual constraints too. Right now I need to wait for thepartition_pdf()
to finish and filter the header and footers out usingelement.metadata.coordinates.points
.Additional context Nope.