Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.26k stars 670 forks source link

Table Title and Table content separate chunks: Merge contents of parent_id and element.id #3012

Open weissenbacherpwc opened 3 months ago

weissenbacherpwc commented 3 months ago

Hi,

I am using partition and chunk_by_title to chunk my pdfs. It generally works but when I investigated the chunks I saw that if there is a Table in one of my documents, the title of the table is always one chunk and the actual content of a table is a separate chunk which I think it not optimal.

E.g. see this example with a pptx-file:

test = pptx_reader("my_file.pptx")
for i in test:
    if i.metadata.get("filetype") == "application/vnd.openxmlformats-officedocument.presentationml.presentation":
        print(i.page_content)
        print(i.metadata)
        print("+++++++++++++++++++++++++")

Prints: +++++++++++++++++++++++++ RAG Evaluation: RAGAS {'file_directory': '...', 'filename': '301123_genai_präsentation.pptx', 'filetype': '...', 'last_modified': '2023-11-30T10:26:30', 'page_number': 15, 'source': '301123_genai_präsentation.pptx', 'source_documents': '301123_genai_präsentation.pptx', 'page': 15} +++++++++++++++++++++++++ Retrieval Generation Model Context Recall Context Precision Faithfulness Llama 2-Chat 0.86 0.58 0.91 LeoLM-Chat 0.86 0.58 0.81 LeoLM-Mistral-Chat 0.86 0.58 0.87 EM German Leo Mistral 0.86 0.58 0.82 Llama-German-Assistant 0.86 0.58 0.91 {'file_directory': '...', 'filename': '301123_genai_präsentation.pptx', 'last_modified': '2023-11-30T10:26:30', 'page_number': 15, 'parent_id': 'a9e22a24894f5c1dbe9b0b66251bbbc2', 'filetype': '...', 'source': '301123_genai_präsentation.pptx', 'source_documents': '301123_genai_präsentation.pptx', 'page': 15}

Question So I see a parent_id key in the second output. How can I merge the content of the first output (the table heading) with the second output, so I would have all in one chunk: RAG Evaluation: RAGAS Retrieval Generation Model Context Recall Context Precision Faithfulness Llama 2-Chat 0.86 0.58 0.91 LeoLM-Chat 0.86 0.58 0.81 LeoLM-Mistral-Chat 0.86 0.58 0.87 EM German Leo Mistral 0.86 0.58 0.82 Llama-German-Assistant 0.86 0.58 0.91

Here is the full code:

import os
import yaml
import box
from unstructured.chunking.title import chunk_by_title
from unstructured.partition.md import partition_md
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.docx import partition_docx
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.xlsx import partition_xlsx
from unstructured.partition.html import partition_html
from langchain_core.documents import Document
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
import re

def filter_elements(elements):
    possible_titles = ["Inhaltsverzeichnis", "Inhalt", "Structure", "Agenda", "Abbildungsverzeichnis", "Gliederung", "Tabellenverzeichnis"] # Filter "Inhaltsverzeichnis"-Pages

    # Find the first element that matches any of the possible titles and is categorized as "Title"
    reference_titles = [
        el for el in elements
        if el.text in possible_titles and el.category == "Title"
    ]
    # Get the ID of the matched title element
    reference_ids = [title.id for title in reference_titles]
    elements = [el for el in elements if el.metadata.parent_id not in reference_ids] 
    # Pattern to detect many dots in a row that indicate a Table of Structure. We want to remove this.
    elements = [el for el in elements if not re.search(r'\.{50}', el.text)]
    # Filtering small chunks below 60 chars, as mostly the are not meaningful
    #html_elements = [el for el in html_elements if len(el.text) > 60]
    elements = [el for el in elements if el.category != "Header"]
    elements = [el for el in elements if el.category != "Footer"]
    return elements

def chunk_elements_by_title(elements):
    elements = chunk_by_title(elements,
                            combine_text_under_n_chars=cfg.UNSTRUCTURED_COMBINE_TEXT_UNDER_N_CHARS, #combine_text_under_n_chars=0,
                            max_characters=cfg.UNSTRUCTURED_MAX_CHARACTERS,
                            new_after_n_chars=cfg.UNSTRUCTURED_NEW_AFTER_N_CHARS
                            )
    return elements

def html_reader(filename):
    html_elements = partition_html(filename=filename, mode="elements")
    html_elements = filter_elements(html_elements)
    html_elements = chunk_elements_by_title(html_elements)
    return html_elements

def powerpoint_reader(filename):
    pptx_elements = partition_pptx(filename=filename)
    pptx_elements = chunk_by_title(pptx_elements)
    return pptx_elements

def markdown_reader(filename):
    md_elements = partition_md(filename=filename)
    md_elements = filter_elements(md_elements)
    md_elements = chunk_elements_by_title(md_elements)
    return md_elements

def excel_reader(filename):
    excel_elements = partition_xlsx(filename=filename)
    excel_elements = chunk_by_title(excel_elements)
    return excel_elements

def word_reader(filename):
    word_elements = partition_docx(filename=filename)
    word_elements = filter_elements(word_elements)
    word_elements = chunk_elements_by_title(word_elements)
    return word_elements

def pdf_reader(filename, llm):
    if cfg.UNSTRUCTURED_CHUNKING_ACTIVATED == True:
        pdf_elements = partition_pdf(filename=filename,
                                    strategy="hi_res",
                                    infer_table_structure=True,
                                    languages=["eng", "deu"],
                                    )  
        pdf_elements = filter_elements(pdf_elements)        
        pdf_elements = chunk_elements_by_title(pdf_elements)
        print(f'PDF Chunking of file {filename} Done')
    return pdf_elements
vs759 commented 1 month ago

Hi, did you find any solution to this? I am having the same problem and would like the table title and content to be in the same chunk to provide appropriate context to the content.

huangpan2507 commented 1 month ago

+1, good question!

scanny commented 1 month ago

If a Title element and whatever element follows it will both fit within max_characters, they will be combined in the same chunk. If not, the Title element will be in a chunk by itself.

So one approach is to increase max_characters, which will allow more titles to be combined with the element that follows them.

A chunker that did exactly what you're asking for would be a different chunker, that is it would not just be a configuration of an existing chunker. I think the spec you're asking for is:

A more "pragmatic" approach might be to do partitioning and chunking in separate steps, and combine Title elements with the following element as a middle step, something like this in overall concept:

elements = partition(file)

def combine_title_elements(elements: Iterable[Element]) -> Iterator[Element]:
    title = None
    for e in elements:
        # -- case where Title immediately follows a Title --
        if isinstance(e, Title):
            if title:
                yield title
            title = e
        # -- case when prior element was a title --
        elif title:
            yield combine_title_with_element_fn_you_wrote_yourself(title, e)
            title = None
        # -- "normal" case when prior element was not a title --
        else:
            yield e

    # -- handle case when last element is a Title --
    if title:
        yield title

chunks = chunk_elements(combine_title_elements(elements))
huangpan2507 commented 1 month ago
combine_title_with_element_fn_you_wrote_yourself

Hi, @scanny , I'm interesting on you code, so, what is the combine_title_with_element_fn_you_wrote_yourself function, can you provide the full code about it? Thanks

scanny commented 1 month ago

That's the function you write yourself, to combine those elements in whatever way suits your purposes.

It could be as simple as:

def combine_title_with_element(title_element: Title, next_element: Element) -> Element:
    next_element.text = f"{title_element.text} {next_element.text}".strip()
    return next_element

but you may also want to make some adjustments to the metadata depending.

huangpan2507 commented 1 month ago

That's the function you write yourself, to combine those elements in whatever way suits your purposes.

It could be as simple as:

def combine_title_with_element(title_element: Title, next_element: Element) -> Element:
    next_element.text = f"{title_element.text} {next_element.text}".strip()
    return next_element

but you may also want to make some adjustments to the metadata depending.

Thanks, @scanny . I guess chunk_elements function is from unstructured.chunking.basic import chunk_elements, right? By the way, I use the from langchain_community.document_loaders import UnstructuredPDFLoader, I wonder the parameter parent_id, I notice some 'category': 'NarrativeText' has the same parent_id, but from the pdf file, some of these with the same parent_id are parts that belong in different contexts, and these had the same parent_id also had the same 'category': 'NarrativeText' . So, What is the principle of dividing parent_id, why does it has the same parent_id? Can you help me?

scanny commented 1 month ago

@huangpan2507 Sounds like a different question related to PDFs. Best to ask that as a separate issue or on the Unstructured Community Slack channel.

huangpan2507 commented 4 weeks ago

@huangpan2507 Sounds like a different question related to PDFs. Best to ask that as a separate issue or on the Unstructured Community Slack channel.

Thanks for your response, oK , I will post a issue on that channel