TikaDocumentConverter does not split content by page

vaclcer commented 4 months ago

The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.

The page separation works as expected with using "PyPDFToDocument".

anakin87 commented 4 months ago

Thanks, @vaclcer!

For context, TikaDocumentConverter split documents by page in v1.x (v1.x TikaConverter), so it might make sense to see if the logic is still valid and port it to v2.x.

ghost commented 4 months ago

Hello, I'm new to contributing to open-source. Can I take a shot at this?

AnushreeBannadabhavi commented 4 months ago

I'd like to take this up if no one is working on it

anakin87 commented 4 months ago

@AnushreeBannadabhavi feel free to work on this! :blue_heart:

(the user who had commented earlier removed his GitHub profile)

lambda-science commented 3 months ago

I'd like to take this up if no one is working on it

Hello @AnushreeBannadabhavi , I had the exact same requirement today about Tika not yielding the page number. Currently I have no idea how to get it properly. The Splitter component for example count the \f tags to get the page number.

Tika however does not provide it, it provides \n\n or \n\n\n but these are not specific to end of page, they can also be in the middle of page so they are not reliable to use.

And alternative I've seen is from this: https://stackoverflow.com/questions/5824867/is-it-possible-to-extract-text-by-page-for-word-pdf-files-using-apache-tika Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p>): So if we extract the content in HTML format we can count the </p></div> just like the Splitter behave.

However in the current Haystack implementation it is not possible to ask the content in HTML format. Here: https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/tika.py#L88 We can see that this parameter is not used. We should add in the init() maybe the xmlContent=False by default paremeter. From Tika python package this is the name of the parameter used to get the data in HTML format https://github.com/chrismattmann/tika-python/blob/master/tika/parser.py#L64C12-L64C22

(For reference, I've seen a lot of modern RAG implementation that prefer to extract and chunk text in HTML format rather than pure text, because LLM don't mind HTML I guess and you keep table structure ?)

anakin87 commented 3 months ago

@lambda-science I generally agree with your idea. This 1.x code can help: https://github.com/deepset-ai/haystack/blob/883cd466bd0108ff4f6af4c389f0e42fabc1282c/haystack/nodes/file_converter/tika.py#L158-L164 It seems that the Tika parser is aware of the pages...

lambda-science commented 3 months ago

@anakin87 This updated version with previous parsing method from Haystack 1.X seem's to work well to add the \f in the content so the Splitter can count them to know page number. However the Cleaner component remove them in all situation because it uses .strip() it's a bit annoying you can't use Tika+Cleaner+Splitter.

Also yes there is an old bug in it: "title of document appearing in the first extracted page"

# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
#
# SPDX-License-Identifier: Apache-2.0

import io
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
from html.parser import HTMLParser

from haystack import Document, component, logging
from haystack.components.converters.utils import get_bytestream_from_source, normalize_metadata
from haystack.dataclasses import ByteStream
from haystack.lazy_imports import LazyImport

with LazyImport("Run 'pip install tika'") as tika_import:
    from tika import parser as tika_parser

logger = logging.getLogger(__name__)

class TikaXHTMLParser(HTMLParser):
    # Use the built-in HTML parser with minimum dependencies
    def __init__(self):
        tika_import.check()
        self.ingest = True
        self.page = ""
        self.pages: List[str] = []
        super(TikaXHTMLParser, self).__init__()

    def handle_starttag(self, tag, attrs):
        # find page div
        pagediv = [value for attr, value in attrs if attr == "class" and value == "page"]
        if tag == "div" and pagediv:
            self.ingest = True

    def handle_endtag(self, tag):
        # close page div, or a single page without page div, save page and open a new page
        if (tag == "div" or tag == "body") and self.ingest:
            self.ingest = False
            # restore words hyphened to the next line
            self.pages.append(self.page.replace("-\n", ""))
            self.page = ""

    def handle_data(self, data):
        if self.ingest:
            self.page += data

@component
class TikaDocumentConverter:
    """
    Converts files of different types to Documents.

    This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
    requires a running Tika server.
    For more options on running Tika,
    see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).

    Usage example:
    ```python
    from haystack.components.converters.tika import TikaDocumentConverter

    converter = TikaDocumentConverter()
    results = converter.run(
        sources=["sample.docx", "my_document.rtf", "archive.zip"],
        meta={"date_added": datetime.now().isoformat()}
    )
    documents = results["documents"]
    print(documents[0].content)
    # 'This is a text from the docx file.'

"""

def __init__(self, tika_url: str = "http://localhost:9998/tika"):
    """
    Create a TikaDocumentConverter component.

    :param tika_url:
        Tika server URL.
    """
    tika_import.check()
    self.tika_url = tika_url

@component.output_types(documents=List[Document])
def run(
        self,
        sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
):
    """
    Converts files to Documents.

    :param sources:
        List of HTML file paths or ByteStream objects.
    :param meta:
        Optional metadata to attach to the Documents.
        This value can be either a list of dictionaries or a single dictionary.
        If it's a single dictionary, its content is added to the metadata of all produced Documents.
        If it's a list, the length of the list must match the number of sources, because the two lists will
        be zipped.
        If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.

    :returns:
        A dictionary with the following keys:
        - `documents`: Created Documents
    """
    documents = []
    meta_list = normalize_metadata(meta=meta, sources_count=len(sources))

    for source, metadata in zip(sources, meta_list):
        try:
            bytestream = get_bytestream_from_source(source)
        except Exception as e:
            logger.warning("Could not read {source}. Skipping it. Error: {error}", source=source, error=e)
            continue
        try:
            parsed = tika_parser.from_buffer(io.BytesIO(bytestream.data), serverEndpoint=self.tika_url,
                                           xmlContent=True)
            parser = TikaXHTMLParser()
            parser.feed(parsed["content"])
        except Exception as conversion_e:
            logger.warning(
                "Failed to extract text from {source}. Skipping it. Error: {error}",
                source=source,
                error=conversion_e,
            )
            continue

        # Old Processing Code from Haystack 1.X Tika integration
        cleaned_pages = []
        # TODO investigate title of document appearing in the first extracted page
        for page in parser.pages:
            lines = page.splitlines()
            cleaned_lines = []
            for line in lines:
                cleaned_lines.append(line)

            page = "\n".join(cleaned_lines)
            cleaned_pages.append(page)
        text = "\f".join(cleaned_pages)
        merged_metadata = {**bytestream.meta, **metadata}
        document = Document(content=text, meta=merged_metadata)
        documents.append(document)
    return {"documents": documents}

anakin87 commented 3 months ago

lambda-science commented 3 months ago

Proposed fix: https://github.com/deepset-ai/haystack/pull/8082

deepset-ai / haystack

TikaDocumentConverter does not split content by page #7949