Open weissenbacherpwc opened 6 months ago
Hi, did you find any solution to this? I am having the same problem and would like the table title and content to be in the same chunk to provide appropriate context to the content.
+1, good question!
If a Title
element and whatever element follows it will both fit within max_characters
, they will be combined in the same chunk. If not, the Title
element will be in a chunk by itself.
So one approach is to increase max_characters
, which will allow more titles to be combined with the element that follows them.
A chunker that did exactly what you're asking for would be a different chunker, that is it would not just be a configuration of an existing chunker. I think the spec you're asking for is:
Title
element with the immediately subsequent element, even if that causes the combined element to be divided using text-splitting.A more "pragmatic" approach might be to do partitioning and chunking in separate steps, and combine Title
elements with the following element as a middle step, something like this in overall concept:
elements = partition(file)
def combine_title_elements(elements: Iterable[Element]) -> Iterator[Element]:
title = None
for e in elements:
# -- case where Title immediately follows a Title --
if isinstance(e, Title):
if title:
yield title
title = e
# -- case when prior element was a title --
elif title:
yield combine_title_with_element_fn_you_wrote_yourself(title, e)
title = None
# -- "normal" case when prior element was not a title --
else:
yield e
# -- handle case when last element is a Title --
if title:
yield title
chunks = chunk_elements(combine_title_elements(elements))
combine_title_with_element_fn_you_wrote_yourself
Hi, @scanny , I'm interesting on you code, so, what is the combine_title_with_element_fn_you_wrote_yourself function, can you provide the full code about it? Thanks
That's the function you write yourself, to combine those elements in whatever way suits your purposes.
It could be as simple as:
def combine_title_with_element(title_element: Title, next_element: Element) -> Element:
next_element.text = f"{title_element.text} {next_element.text}".strip()
return next_element
but you may also want to make some adjustments to the metadata depending.
That's the function you write yourself, to combine those elements in whatever way suits your purposes.
It could be as simple as:
def combine_title_with_element(title_element: Title, next_element: Element) -> Element: next_element.text = f"{title_element.text} {next_element.text}".strip() return next_element
but you may also want to make some adjustments to the metadata depending.
Thanks, @scanny . I guess chunk_elements function is
from unstructured.chunking.basic import chunk_elements
, right? By the way, I use the
from langchain_community.document_loaders import UnstructuredPDFLoader
,
I wonder the parameter parent_id
, I notice some 'category': 'NarrativeText' has the same parent_id
,
but from the pdf file, some of these with the same parent_id
are parts that belong in different contexts, and these had the same parent_id
also had the same 'category': 'NarrativeText' .
So, What is the principle of dividing parent_id
, why does it has the same parent_id
? Can you help me?
@huangpan2507 Sounds like a different question related to PDFs. Best to ask that as a separate issue or on the Unstructured Community Slack channel.
@huangpan2507 Sounds like a different question related to PDFs. Best to ask that as a separate issue or on the Unstructured Community Slack channel.
Thanks for your response, oK , I will post a issue on that channel
Hi,
I am using partition and chunk_by_title to chunk my pdfs. It generally works but when I investigated the chunks I saw that if there is a Table in one of my documents, the title of the table is always one chunk and the actual content of a table is a separate chunk which I think it not optimal.
E.g. see this example with a pptx-file:
Prints: +++++++++++++++++++++++++ RAG Evaluation: RAGAS {'file_directory': '...', 'filename': '301123_genai_präsentation.pptx', 'filetype': '...', 'last_modified': '2023-11-30T10:26:30', 'page_number': 15, 'source': '301123_genai_präsentation.pptx', 'source_documents': '301123_genai_präsentation.pptx', 'page': 15} +++++++++++++++++++++++++ Retrieval Generation Model Context Recall Context Precision Faithfulness Llama 2-Chat 0.86 0.58 0.91 LeoLM-Chat 0.86 0.58 0.81 LeoLM-Mistral-Chat 0.86 0.58 0.87 EM German Leo Mistral 0.86 0.58 0.82 Llama-German-Assistant 0.86 0.58 0.91 {'file_directory': '...', 'filename': '301123_genai_präsentation.pptx', 'last_modified': '2023-11-30T10:26:30', 'page_number': 15, 'parent_id': 'a9e22a24894f5c1dbe9b0b66251bbbc2', 'filetype': '...', 'source': '301123_genai_präsentation.pptx', 'source_documents': '301123_genai_präsentation.pptx', 'page': 15}
Question So I see a parent_id key in the second output. How can I merge the content of the first output (the table heading) with the second output, so I would have all in one chunk: RAG Evaluation: RAGAS Retrieval Generation Model Context Recall Context Precision Faithfulness Llama 2-Chat 0.86 0.58 0.91 LeoLM-Chat 0.86 0.58 0.81 LeoLM-Mistral-Chat 0.86 0.58 0.87 EM German Leo Mistral 0.86 0.58 0.82 Llama-German-Assistant 0.86 0.58 0.91
Here is the full code: