Stuck in processing stage

lipsa7 commented 1 year ago

I'm stuck in pre-processing stage. Can someone please help?

from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"]) doc_pdf = converter.convert(file_path="/content/SPI_Electrification_15.pdf", meta=None)[0]

This is the error I'm getting: MissingOptionalDependency: Optional dependency 'langdetect' was used but it isn't installed.

anakin87 commented 1 year ago

Hello @lipsa7!

The library langdetect is necessary to determine the language of the documents.

You can install it in two alternative ways:

pip install farm-haystack[preprocessing] or
pip install langdetect

Does this solve your problem?

lipsa7 commented 1 year ago

Hi, thanks for your answer. I installed langdetect but that didn't solve it. I read somewhere that the issue is with colab, so I switched to vscode. Facing different issues now :D

s-m-arafat commented 1 year ago

Hi, I am also facing the same issue. Few days ago it was working fine but recently I tried to create a venv and the issue started. After installing langdetect it's showing to install docx then azure and so on. I am using vscode and flask server

lipsa7 commented 1 year ago

https://github.com/deepset-ai/haystack/discussions/4930#discussioncomment-5928273

I have posted the code I used n this works. You can check this.

s-m-arafat commented 1 year ago

#4930 (reply in thread)

I have posted the code I used n this works. You can check this.

thanks for your reply. I tried pip install farm-haystack. then started a server using flask but it crashes saying langdetect isn't installed. here is my code

from flask import Flask, request, jsonify, send_file
from flask_cors import CORS
from haystack.utils import convert_files_to_docs
from haystack.nodes import PreProcessor, BM25Retriever, FARMReader
from multiprocessing import freeze_support
from haystack.document_stores import InMemoryDocumentStore
from haystack import Pipeline
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import SentenceTransformersRanker
import pickle
# from pdf2image import convert_from_path
import os

app = Flask(__name__)
CORS(app)

@app.route("/trainModel")
def trainModel():
    freeze_support()
//codes
        # process docs
        processed_docs = PreProcessor(
            clean_empty_lines=True,
            clean_whitespace=True,
            split_by="sentence",
            split_length=5,
            add_page_number=True,
            split_respect_sentence_boundary=False,  # NotImplementedError: 'split_respect_sentence_boundary=True' is only compatible with split_by='word'.
        ).process(all_docs)

@app.route("/ref")
def send_ref():
//codes

if __name__ == "__main__":
    app.run(debug=True, port=8080)

in terminal

 * Serving Flask app 'app'
 * Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:8080
Press CTRL+C to quit
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 493-912-585
Traceback (most recent call last):
  File "D:\Works\OfficialProjects\DocsML\server\app.py", line 147, in <module>
    app.run(debug=True, port=8080)
  File "D:\Installed\Py 3.10\lib\site-packages\flask\app.py", line 889, in run
    run_simple(t.cast(str, host), port, self, **options)
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\serving.py", line 1097, in run_simple
    run_with_reloader(
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 452, in run_with_reloader
    with reloader:
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 292, in __enter__
    return super().__enter__()
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 243, in __enter__
    self.run_step()
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 295, in run_step
    for name in _find_stat_paths(self.extra_files, self.exclude_patterns):
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 114, in _find_stat_paths
    paths.update(_iter_module_paths())
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 46, in _iter_module_paths
    if name is None or name.startswith(_ignore_always):
  File "D:\Installed\Py 3.10\lib\site-packages\generalimport\fake_module.py", line 19, in error_func
    raise MissingOptionalDependency(f"Optional dependency {name} was used but it isn't installed.")
generalimport.exception.MissingOptionalDependency: Optional dependency 'langdetect' was used but it isn't installed.

anakin87 commented 1 year ago

@s-m-arafat have you tried these solutions?

https://github.com/deepset-ai/haystack/issues/4911#issuecomment-1546918282

deepset-ai / haystack

Stuck in processing stage #4911