A tool that can extract and divide sections from one or more .tex and pdf files

Feature request

I propose a tool that can extract the content of each section from one .tex file or a latex project with multiple .tex files. Moreover, the tool is able to filter the unrequired contents like figure blocks, labels and comments and output the resulting contents in the form of a python dict as {

: }. With this tool, we can extract only the "introduction", "related works" and "conclusion" part of a paper and shorten the contents by filtering, which is beneficial for effective summary.

We can do the same thing to pdf files with no bookmarks based on science-parse, which can be set up as a docker server and we will only need an API to use it. It takes pdf as input and outputs the metadata and the division of sections in json form. So I propose an API wrapper for that in order to make use of this powerful tool.

Motivation

The original langchain.text_splitter.LatexTextSplitter cannot handle multiple .tex files, while it cannot filter some contents that are not required for text analysis, like comments or figure blocks. Since many source files we download from arxiv.org will be a compressed project that has multiple .tex files with a main.tex that can link them together, we need a way to deal with them. Moreover, when dealing with the source files, some latex blocks are not necessary for text analysis, like figures and comments. By filtering them, we can shorten the contents and reduce the work of LLMs.

Moreover, when loading pdf with no bookmarks, we cannot seperate sections of them and be forced to use all of them at once. This may not be efficient when it comes to scenarios like summarization. So we may need to have a tool that can divide the pdf file without prior input like bookmarks.

Your contribution

I want to create a PR for document_loaders so there can be a way to load a latex project downloaded from arxiv.org in the form of tar.gz orzip . Then I want to create a PR for text_splitter so I can implement the filtering and extraction for the latex file(s) I obtain from the document_loaders. I also want to create an API wrapper for science-parse in the same file which can takes the pdf files as input directly by pathlib.Path in the text_splitter as another splitting function.

File: langchain/text_splitter.py

To update the langchain/text_splitter.py file to complete the request, I would do the following:

Add the necessary imports at the beginning of the file:

import os
import tarfile
import zipfile
from pathlib import Path
import json
import requests

Create a new function load_latex_project to handle loading a LaTeX project from a compressed file (tar.gz or zip):

def load_latex_project(file_path: str) -> str:
extracted_files = []
file_path = Path(file_path)
if file_path.suffix == ".tar.gz":
    with tarfile.open(file_path, "r:gz") as tar:
        tar.extractall()
        extracted_files = tar.getnames()
elif file_path.suffix == ".zip":
    with zipfile.ZipFile(file_path, "r") as zip_ref:
        zip_ref.extractall()
        extracted_files = zip_ref.namelist()
else:
    raise ValueError("Unsupported file format. Please provide a .tar.gz or .zip file.")

main_tex = ""
for file in extracted_files:
    if "main.tex" in file:
        with open(file, "r") as f:
            main_tex = f.read()
        break

if not main_tex:
    raise ValueError("main.tex not found in the provided LaTeX project.")

return main_tex

Update the LatexTextSplitter class to include the new functionality for filtering and extraction:

class LatexTextSplitter(RecursiveCharacterTextSplitter):
...
def split_text(self, text: str) -> List[str]:
    # Add code to handle multiple .tex files and filtering
    main_tex = load_latex_project(text)
    # Add code to filter out unnecessary content (e.g., comments, figure blocks)
    filtered_tex = filter_latex_content(main_tex)
    # Call the parent class's split_text method with the filtered content
    return super().split_text(filtered_tex)

Create a new function filter_latex_content to filter out unnecessary content from the LaTeX file:

def filter_latex_content(text: str) -> str:
# Add code to filter out comments, figure blocks, and other unnecessary content
...
return filtered_text

Create a new class ScienceParseAPIWrapper to wrap the science-parse API:

class ScienceParseAPIWrapper:
def __init__(self, api_url: str):
    self.api_url = api_url

def parse_pdf(self, pdf_path: str) -> dict:
    with open(pdf_path, "rb") as pdf_file:
        response = requests.post(self.api_url, files={"file": pdf_file})
    if response.status_code == 200:
        return json.loads(response.text)
    else:
        raise ValueError(f"Error in parsing PDF: {response.text}")

Update the TextSplitter class to include a new method split_pdf that uses the ScienceParseAPIWrapper:

class TextSplitter(BaseDocumentTransformer, ABC):
...
def split_pdf(self, pdf_path: str, api_url: str) -> List[str]:
    science_parse_api = ScienceParseAPIWrapper(api_url)
    parsed_data = science_parse_api.parse_pdf(pdf_path)
    # Extract sections from the parsed_data and return them as a list
    ...

With these changes, the langchain/text_splitter.py file will be able to handle LaTeX projects with multiple .tex files, filter out unnecessary content, and split PDF files using the science-parse API.

devstein / langchain