Open LunaticMaestro opened 6 months ago
@LunaticMaestro font style is stored in .metadata.emphasized_text_contents
and .metadata.emphasized_text_tags
. Did you look there?
Hi scanny, Thanks for reply. Unfortunately, the suggested metadata does not contain the requested content.
Find the screenshot attached.
I am using the PDF from example docs example-docs/layout-parser-paper.pdf
Hi @LunaticMaestro yes, unfortunately it turns out that metadata is not supported for PDF, apologies for that.
It is supported for DOCX however if that's a help.
I beg to differ. Here's the example snippet reading DOCX file and failing to decipher the font elements.
Find the DOCX file attached for purpose of reproduing. redacted.docx
@LunaticMaestro the file you referenced has character styling set using a character style, which is unfortunately not yet supported.
However, text that is made bold or italic directly, using the toolbar buttons is properly detected.
I added the following paragraph to the document: "This is a paragraph that has some bold and some italic.", with the words "bold" and "italic" formatted with the toolbar buttons and it produces the following metadata:
{
'category_depth': 0,
'emphasized_text_contents': ['bold', 'italic'],
'emphasized_text_tags': ['b', 'i'],
'last_modified': '2024-03-27T22:03:51',
'languages': ['eng'],
'parent_id': 'ede9865e755cdea84eb99e51cb277a0e',
'file_directory': '/Users/scanny/Desktop',
'filename': 'redacted.docx',
'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
}
Since unstructured re-uses pdfminer
reference. I am expecting for native implementations of pdf miner to get the character properties, example: pdf miner character style.
bump - would have really liked to have text details such as font , size etc as a part of metadata. should not be too difficult to add because usually the underlying pdf extractor has this info.
I was trying out the tutorial. However, when partitioning the PDF provided in tutorial, I did not observe that the font-style of the text being stored in the Metadata for the element.
Is the font-style extraction planned in future?