Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.95k stars 733 forks source link

bug/dont-clean-bullets-in-partition-docx #3463

Open jgen1 opened 2 months ago

jgen1 commented 2 months ago

Problem

Using partition_docx removes bullets from text. This is unstructured/unstructured/docx/partition.py Lines 474-484

        # NOTE(scanny) - a list-item gets some special treatment, mutating the text to remove a
        # bullet-character if present.
        if self._is_list_item(paragraph):
            clean_text = clean_bullets(text).strip()
            if clean_text:
                yield ListItem(
                    text=clean_text,
                    metadata=metadata,
                    detection_origin=DETECTION_ORIGIN,
                )
            return

Solution

  1. Make this a configurable parameter
  2. Just remove this from the docx partitioning.

Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.

@scanny I see you are commented on this code chunk, do you have any thoughts?

scanny commented 2 months ago

@jgen1 what is the problem this produces? Or is it just a matter of principle like segregation of responsibilities?

In several formats, like HTML and DOCX off the top of my head, a list-item is indicated semantically, like by being a <li> HTML element or in DOCX by having a List Item paragraph style applied. So there is no bullet character present in the text in those cases.

Removing a "manual" bullet character makes ListItem elements consistent (text only, no leading bullet character) across the various document types, so that's a plus to the way it is at present.

jgen1 commented 2 months ago

Thank you for the quick response @scanny.

In my use case, I want to capture bullet points, or whatever the numbered list item actually is.

I can't speak to HTML, but I know for Word, if you go create a docx file, add in a bunch of bullets/numbered lists and run that through python-docx, it will not include the bullet or numbered list in the text of that "List Item".

  1. This is number 1
  2. This is number 2 a. This is a

If a file with the content above is loaded into python-docx, it will show each item's text as "This is number 1", "This is number 2", "This is a" without the actual 1., 2., and a. in the text. The same is true for bullets. This is because the actual text of those numbered list isn't stored directly in text. So clean_bullets there is not helpful to remove bullets because those List Items don't contain them in the text anyway. Now for my use case where I want those bullets to appear as text, I am able to run a docx macro to convert all the numbered lists to text, but then when I use that with partition_docx, it is cleaning the bullets away.

scanny commented 2 months ago

Removed the bug label since the current behavior is the expected behavior.

I think the enhancement idea is to capture bullet metadata, in particular for numbered list-items.

jgen1 commented 2 months ago

Quick distinction: python-docx does not currently capture the bullet metadata for lists, so that would be a feature they would have to implement. What I would want here is, if a List Item does happen to contain a bullet string, don't remove that bullet. If the list item string contained a "1. " as a numbered list that would not get removed. Similarly with other bullet-type characters like "-" and "o". So to me it seems clean_bullets should not happen here. So this enhancement would just be to take the clean_bullet out of the partition - see PR