Open jgen1 opened 2 months ago
@jgen1 what is the problem this produces? Or is it just a matter of principle like segregation of responsibilities?
In several formats, like HTML and DOCX off the top of my head, a list-item is indicated semantically, like by being a <li>
HTML element or in DOCX by having a List Item
paragraph style applied. So there is no bullet character present in the text in those cases.
Removing a "manual" bullet character makes ListItem
elements consistent (text only, no leading bullet character) across the various document types, so that's a plus to the way it is at present.
Thank you for the quick response @scanny.
In my use case, I want to capture bullet points, or whatever the numbered list item actually is.
I can't speak to HTML, but I know for Word, if you go create a docx file, add in a bunch of bullets/numbered lists and run that through python-docx, it will not include the bullet or numbered list in the text of that "List Item".
If a file with the content above is loaded into python-docx, it will show each item's text as "This is number 1", "This is number 2", "This is a" without the actual 1., 2., and a. in the text. The same is true for bullets. This is because the actual text of those numbered list isn't stored directly in text. So clean_bullets there is not helpful to remove bullets because those List Items don't contain them in the text anyway. Now for my use case where I want those bullets to appear as text, I am able to run a docx macro to convert all the numbered lists to text, but then when I use that with partition_docx, it is cleaning the bullets away.
Removed the bug
label since the current behavior is the expected behavior.
I think the enhancement idea is to capture bullet metadata, in particular for numbered list-items.
Quick distinction: python-docx does not currently capture the bullet metadata for lists, so that would be a feature they would have to implement. What I would want here is, if a List Item does happen to contain a bullet string, don't remove that bullet. If the list item string contained a "1. " as a numbered list that would not get removed. Similarly with other bullet-type characters like "-" and "o". So to me it seems clean_bullets should not happen here. So this enhancement would just be to take the clean_bullet out of the partition - see PR
Problem
Using partition_docx removes bullets from text. This is unstructured/unstructured/docx/partition.py Lines 474-484
Solution
Personally, I am in favor of item 2. In my opinion, cleaning of text should not occur in the partitioning function. My use case requires all text, including bullets, to be pulled from word documents. Unstructured has separate steps for cleaning, including removing bullets, so it seems that this code shouldn't be in the partitioning.
@scanny I see you are commented on this code chunk, do you have any thoughts?