Closed vaclcer closed 3 months ago
Thanks, @vaclcer!
For context, TikaDocumentConverter
split documents by page in v1.x (v1.x TikaConverter
), so it might make sense to see if the logic is still valid and port it to v2.x.
Hello, I'm new to contributing to open-source. Can I take a shot at this?
I'd like to take this up if no one is working on it
@AnushreeBannadabhavi feel free to work on this! :blue_heart:
(the user who had commented earlier removed his GitHub profile)
I'd like to take this up if no one is working on it
Hello @AnushreeBannadabhavi , I had the exact same requirement today about Tika not yielding the page number.
Currently I have no idea how to get it properly.
The Splitter component for example count the \f
tags to get the page number.
Tika however does not provide it, it provides \n\n
or \n\n\n
but these are not specific to end of page, they can also be in the middle of page so they are not reliable to use.
And alternative I've seen is from this: https://stackoverflow.com/questions/5824867/is-it-possible-to-extract-text-by-page-for-word-pdf-files-using-apache-tika
Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p>):
So if we extract the content in HTML format we can count the </p></div>
just like the Splitter behave.
However in the current Haystack implementation it is not possible to ask the content in HTML format.
Here: https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/tika.py#L88
We can see that this parameter is not used. We should add in the init()
maybe the xmlContent=False
by default paremeter.
From Tika python package this is the name of the parameter used to get the data in HTML format https://github.com/chrismattmann/tika-python/blob/master/tika/parser.py#L64C12-L64C22
(For reference, I've seen a lot of modern RAG implementation that prefer to extract and chunk text in HTML format rather than pure text, because LLM don't mind HTML I guess and you keep table structure ?)
@lambda-science I generally agree with your idea. This 1.x code can help: https://github.com/deepset-ai/haystack/blob/883cd466bd0108ff4f6af4c389f0e42fabc1282c/haystack/nodes/file_converter/tika.py#L158-L164 It seems that the Tika parser is aware of the pages...
@anakin87 This updated version with previous parsing method from Haystack 1.X seem's to work well to add the \f
in the content so the Splitter can count them to know page number.
However the Cleaner component remove them in all situation because it uses .strip()
it's a bit annoying you can't use Tika+Cleaner+Splitter.
Also yes there is an old bug in it: "title of document appearing in the first extracted page"
# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
#
# SPDX-License-Identifier: Apache-2.0
import io
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
from html.parser import HTMLParser
from haystack import Document, component, logging
from haystack.components.converters.utils import get_bytestream_from_source, normalize_metadata
from haystack.dataclasses import ByteStream
from haystack.lazy_imports import LazyImport
with LazyImport("Run 'pip install tika'") as tika_import:
from tika import parser as tika_parser
logger = logging.getLogger(__name__)
class TikaXHTMLParser(HTMLParser):
# Use the built-in HTML parser with minimum dependencies
def __init__(self):
tika_import.check()
self.ingest = True
self.page = ""
self.pages: List[str] = []
super(TikaXHTMLParser, self).__init__()
def handle_starttag(self, tag, attrs):
# find page div
pagediv = [value for attr, value in attrs if attr == "class" and value == "page"]
if tag == "div" and pagediv:
self.ingest = True
def handle_endtag(self, tag):
# close page div, or a single page without page div, save page and open a new page
if (tag == "div" or tag == "body") and self.ingest:
self.ingest = False
# restore words hyphened to the next line
self.pages.append(self.page.replace("-\n", ""))
self.page = ""
def handle_data(self, data):
if self.ingest:
self.page += data
@component
class TikaDocumentConverter:
"""
Converts files of different types to Documents.
This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
requires a running Tika server.
For more options on running Tika,
see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
Usage example:
```python
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter()
results = converter.run(
sources=["sample.docx", "my_document.rtf", "archive.zip"],
meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the docx file.'
"""
def __init__(self, tika_url: str = "http://localhost:9998/tika"):
"""
Create a TikaDocumentConverter component.
:param tika_url:
Tika server URL.
"""
tika_import.check()
self.tika_url = tika_url
@component.output_types(documents=List[Document])
def run(
self,
sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
):
"""
Converts files to Documents.
:param sources:
List of HTML file paths or ByteStream objects.
:param meta:
Optional metadata to attach to the Documents.
This value can be either a list of dictionaries or a single dictionary.
If it's a single dictionary, its content is added to the metadata of all produced Documents.
If it's a list, the length of the list must match the number of sources, because the two lists will
be zipped.
If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
:returns:
A dictionary with the following keys:
- `documents`: Created Documents
"""
documents = []
meta_list = normalize_metadata(meta=meta, sources_count=len(sources))
for source, metadata in zip(sources, meta_list):
try:
bytestream = get_bytestream_from_source(source)
except Exception as e:
logger.warning("Could not read {source}. Skipping it. Error: {error}", source=source, error=e)
continue
try:
parsed = tika_parser.from_buffer(io.BytesIO(bytestream.data), serverEndpoint=self.tika_url,
xmlContent=True)
parser = TikaXHTMLParser()
parser.feed(parsed["content"])
except Exception as conversion_e:
logger.warning(
"Failed to extract text from {source}. Skipping it. Error: {error}",
source=source,
error=conversion_e,
)
continue
# Old Processing Code from Haystack 1.X Tika integration
cleaned_pages = []
# TODO investigate title of document appearing in the first extracted page
for page in parser.pages:
lines = page.splitlines()
cleaned_lines = []
for line in lines:
cleaned_lines.append(line)
page = "\n".join(cleaned_lines)
cleaned_pages.append(page)
text = "\f".join(cleaned_pages)
merged_metadata = {**bytestream.meta, **metadata}
document = Document(content=text, meta=merged_metadata)
documents.append(document)
return {"documents": documents}
Proposed fix: https://github.com/deepset-ai/haystack/pull/8082
The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.
The page separation works as expected with using "PyPDFToDocument".