Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.68k stars 707 forks source link

feat/ extract style or font for Text elements. #2695

Open LunaticMaestro opened 6 months ago

LunaticMaestro commented 6 months ago

I was trying out the tutorial. However, when partitioning the PDF provided in tutorial, I did not observe that the font-style of the text being stored in the Metadata for the element.

Is the font-style extraction planned in future?

scanny commented 6 months ago

@LunaticMaestro font style is stored in .metadata.emphasized_text_contents and .metadata.emphasized_text_tags. Did you look there?

LunaticMaestro commented 6 months ago

Hi scanny, Thanks for reply. Unfortunately, the suggested metadata does not contain the requested content.

Find the screenshot attached.

I am using the PDF from example docs example-docs/layout-parser-paper.pdf

image
scanny commented 6 months ago

Hi @LunaticMaestro yes, unfortunately it turns out that metadata is not supported for PDF, apologies for that.

It is supported for DOCX however if that's a help.

LunaticMaestro commented 6 months ago

I beg to differ. Here's the example snippet reading DOCX file and failing to decipher the font elements.

Find the DOCX file attached for purpose of reproduing. redacted.docx

image

scanny commented 6 months ago

@LunaticMaestro the file you referenced has character styling set using a character style, which is unfortunately not yet supported.

However, text that is made bold or italic directly, using the toolbar buttons is properly detected.

I added the following paragraph to the document: "This is a paragraph that has some bold and some italic.", with the words "bold" and "italic" formatted with the toolbar buttons and it produces the following metadata:

{
    'category_depth': 0,
    'emphasized_text_contents': ['bold', 'italic'],
    'emphasized_text_tags': ['b', 'i'],
    'last_modified': '2024-03-27T22:03:51',
    'languages': ['eng'],
    'parent_id': 'ede9865e755cdea84eb99e51cb277a0e',
    'file_directory': '/Users/scanny/Desktop',
    'filename': 'redacted.docx',
    'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
}
LunaticMaestro commented 6 months ago

Since unstructured re-uses pdfminer reference. I am expecting for native implementations of pdf miner to get the character properties, example: pdf miner character style.

thusithaC commented 3 months ago

bump - would have really liked to have text details such as font , size etc as a part of metadata. should not be too difficult to add because usually the underlying pdf extractor has this info.