Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.77k stars 718 forks source link

Unstructured not recognizing detectron installation? #718

Closed eRuaro closed 1 year ago

eRuaro commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

To Reproduce I was able to use it 2 months ago for parsing scanned PDFs, but when I rebuilt my docker container, it keeps using pdfminer instead. Now whenever I try to parse a scanned PDF, it returns an empty array when running loader.load_and_split().

Here's my dockerfile:

FROM python:3.9-slim-buster

# Update package lists
RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 gcc g++ git build-essential libpoppler-cpp-dev libmagic-dev pkg-config poppler-utils tesseract-ocr libtesseract-dev -y

# Make working directories
RUN  mkdir -p  /app
WORKDIR  /app

# Copy the requirements.txt file to the container
COPY requirements.txt .

# Install dependencies
RUN pip install --upgrade pip

RUN pip install torch torchvision torchaudio

RUN pip install unstructured-inference

RUN pip install -r requirements.txt

RUN pip install 'git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2'

# Copy the .env file to the container
COPY .env .

# Copy every file in the source folder to the created working directory
COPY  . .

# Expose the port that the application will run on
EXPOSE 8080

# Start the application
CMD ["python3.9", "-m", "uvicorn", "main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

Here's the code segment that uses Unstructured:

@app.post("/document/index/scanned")
async def index_scanned_document(document: Document):
    try:
        loader = OnlinePDFLoader(document.pdf_url)
        recursive_text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            chunk_size=2000,
            chunk_overlap=100,
        )
        data = loader.load_and_split(text_splitter=recursive_text_splitter)

        if (len(data) == 0):
            raise Exception("No texts found")

        embeddings = OpenAIEmbeddings()

        text_data = [d.page_content for d in data]
        text_metadata = [{"source": f"{i}-pl"} for i in range(len(data))]

        db = PGVector.from_texts(
            texts=text_data,
            embedding=embeddings,
            collection_name=document.user_id + "/scanned/" + document.pdf_title,
            connection_string=connection_string,
            distance_strategy=DistanceStrategy.COSINE,
            metadatas=text_metadata,
            pre_delete_collection=False
        )
    except Exception as e:
        raise HTTPException(status_code=404, detail={
            "message": "Failed to index document",
            "error": str(e)
        })
    else:
        return {
            "message": "Document indexed successfully",
        }

Expected behavior Unstructured will use detectron and not pdfminer

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context I'm using langchain which uses unstructured under the hood: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html#using-unstructured

MthwRobinson commented 1 year ago

@eRuaro - As of 0.7.0, detectron2 is installed using the ONNX runtime to eliminate the need to install detectron2 from source. If you're using a version more recent than than you shouldn't need the detectron2 installation step in your Dockerfle any longer. cc @qued @benjats07

Check out this comment on the other issue you posted, I think that's likely the root of the pdfminer behavior you're seeing.