deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.57k stars 1.91k forks source link

Stuck in processing stage #4911

Closed lipsa7 closed 1 year ago

lipsa7 commented 1 year ago

I'm stuck in pre-processing stage. Can someone please help?

from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"]) doc_pdf = converter.convert(file_path="/content/SPI_Electrification_15.pdf", meta=None)[0]

This is the error I'm getting: MissingOptionalDependency: Optional dependency 'langdetect' was used but it isn't installed.

anakin87 commented 1 year ago

Hello @lipsa7!

The library langdetect is necessary to determine the language of the documents.

You can install it in two alternative ways:

Does this solve your problem?

lipsa7 commented 1 year ago

Hi, thanks for your answer. I installed langdetect but that didn't solve it. I read somewhere that the issue is with colab, so I switched to vscode. Facing different issues now :D

s-m-arafat commented 1 year ago

Hi, I am also facing the same issue. Few days ago it was working fine but recently I tried to create a venv and the issue started. After installing langdetect it's showing to install docx then azure and so on. I am using vscode and flask server

lipsa7 commented 1 year ago

https://github.com/deepset-ai/haystack/discussions/4930#discussioncomment-5928273

I have posted the code I used n this works. You can check this.

s-m-arafat commented 1 year ago

#4930 (reply in thread)

I have posted the code I used n this works. You can check this.

thanks for your reply. I tried pip install farm-haystack. then started a server using flask but it crashes saying langdetect isn't installed. here is my code

from flask import Flask, request, jsonify, send_file
from flask_cors import CORS
from haystack.utils import convert_files_to_docs
from haystack.nodes import PreProcessor, BM25Retriever, FARMReader
from multiprocessing import freeze_support
from haystack.document_stores import InMemoryDocumentStore
from haystack import Pipeline
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import SentenceTransformersRanker
import pickle
# from pdf2image import convert_from_path
import os

app = Flask(__name__)
CORS(app)

@app.route("/trainModel")
def trainModel():
    freeze_support()
//codes
        # process docs
        processed_docs = PreProcessor(
            clean_empty_lines=True,
            clean_whitespace=True,
            split_by="sentence",
            split_length=5,
            add_page_number=True,
            split_respect_sentence_boundary=False,  # NotImplementedError: 'split_respect_sentence_boundary=True' is only compatible with split_by='word'.
        ).process(all_docs)

@app.route("/ref")
def send_ref():
//codes

if __name__ == "__main__":
    app.run(debug=True, port=8080)

in terminal

 * Serving Flask app 'app'
 * Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:8080
Press CTRL+C to quit
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 493-912-585
Traceback (most recent call last):
  File "D:\Works\OfficialProjects\DocsML\server\app.py", line 147, in <module>
    app.run(debug=True, port=8080)
  File "D:\Installed\Py 3.10\lib\site-packages\flask\app.py", line 889, in run
    run_simple(t.cast(str, host), port, self, **options)
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\serving.py", line 1097, in run_simple
    run_with_reloader(
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 452, in run_with_reloader
    with reloader:
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 292, in __enter__
    return super().__enter__()
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 243, in __enter__
    self.run_step()
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 295, in run_step
    for name in _find_stat_paths(self.extra_files, self.exclude_patterns):
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 114, in _find_stat_paths
    paths.update(_iter_module_paths())
  File "D:\Installed\Py 3.10\lib\site-packages\werkzeug\_reloader.py", line 46, in _iter_module_paths
    if name is None or name.startswith(_ignore_always):
  File "D:\Installed\Py 3.10\lib\site-packages\generalimport\fake_module.py", line 19, in error_func
    raise MissingOptionalDependency(f"Optional dependency {name} was used but it isn't installed.")
generalimport.exception.MissingOptionalDependency: Optional dependency 'langdetect' was used but it isn't installed.
anakin87 commented 1 year ago

@s-m-arafat have you tried these solutions?

https://github.com/deepset-ai/haystack/issues/4911#issuecomment-1546918282