Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.43k stars 692 forks source link

feat/group elements by parent_id #1489

Open ron-unstructured opened 11 months ago

ron-unstructured commented 11 months ago

Is your feature request related to a problem? Please describe. Following up with the document hierarchy implementation, it'll be helpful to have a built-in function to group elements with the same parent_id.

Describe the solution you'd like Similar to chunk_by_title, but the parent type is not always a Title.

Describe alternatives you've considered Group elements with the same parent_id and assign the previous element as the parent where parent_id is None.

Additional context n/a

dantes-ai commented 10 months ago

Is there any news on that, or a work around to group chunks by parent id?

qy2144 commented 7 months ago

The current chunk_by_title function does not retain parent-child relationship. Often a parent is grouped into previous chunk, even though itself is not a child of the previous chunk. Would like to see a new method that will respect parent/child relationship.

weissenbacherpwc commented 4 months ago

also looking for such a function.

MthwRobinson commented 4 months ago

If anyone is interested in picking this up as a first issue, I think it would make sense in unstructured/utils.py

huangpan2507 commented 1 month ago

If anyone is interested in picking this up as a first issue, I think it would make sense in unstructured/utils.py

hi. @MthwRobinson , thanks for your information, do you mean it?

def is_parent_box(parent_target: Box, child_target: Box, add: float = 0.0) -> bool: '''True if the child_target bounding box is nested in the parent_target.

Box format: [x_bottom_left, y_bottom_left, x_top_right, y_top_right]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent region'''