deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

Textract extracts different text accroding to the OS it operates on #423

Open bennnym opened 2 years ago

bennnym commented 2 years ago

Describe the bug I operate locally on a Mac and a simple test from a sample pdf passes locally, but fails in a Docker container.

To Reproduce Steps to reproduce the behaviour:

  1. Download pdf from here to root of your folder.
  2. name the pdf "sample.pdf"
  3. add the following dependancies to a requirements.txt file ( in the root ) and pip install them with pip install -r requirements.txt
    textract==1.6.5
    pytest==6.2.4
  4. Create a test in a file also placed in the root of your project @ test_textract.py
    
    import textract

class TestTextract: def test_occorunces_of_string(self): path = "sample_sectionals.pdf" extracted_text = textract.process(path, method="pdfminer").decode("utf-8") while "\n" in extracted_text or "\n" in extracted_text: extracted_text = extracted_text.replace("\n", " ") extracted_text = extracted_text.replace("\n", " ")

5. Run test with command: `pytest -s -vv`
6.  Add a Dockerfile in the root with the following content:
```docker
FROM        python:3.8.13-slim-buster

RUN         pip install --upgrade pip

COPY        . /opt/code

RUN         pip install textract==1.6.5 pytest==6.2.4

WORKDIR     /opt/code

ENTRYPOINT  [ "/bin/sh" ]
  1. docker build . -t test
  2. docker run test -c "pytest -s -vv test_textract.py"

Expected behaviour This test passes locally, and fails in Docker container ( with zero occurrences )

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):