deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.59k stars 1.91k forks source link

Haystack and FastAPI in Colab #1725

Closed shainaraza closed 2 years ago

shainaraza commented 3 years ago

Question Put your question here How to use FastAPI, haystack with Colab Additional context Add any other context or screenshots about the question (optional).

FAQ Check

I have this piece of code, and I am unable to have run haystack on Colab. There is no syntax error but fastAPI does not pick the data from pipeline. Any advise?

`!pip install fastapi nest-asyncio pyngrok uvicorn !pip install git+https://github.com/deepset-ai/haystack.git

In Colab / No Docker environments: Start Elasticsearch from source

! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q ! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz ! chown -R daemon:daemon elasticsearch-7.9.2

import os from subprocess import Popen, PIPE, STDOUT es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) # as daemon )

wait until ES has started

! sleep 15

from fastapi import FastAPI from haystack.document_store.elasticsearch import ElasticsearchDocumentStore from haystack.document_stores import ElasticsearchDocumentStore

from haystack.retriever.sparse import ElasticsearchRetriever from haystack.reader.farm import FARMReader from haystack.pipeline import ExtractiveQAPipeline

initialize doc store, retriever and reader components

DOC_STORE = ElasticsearchDocumentStore( host='localhost', username='', password='', index='aurelius' ) RETRIEVER = ElasticsearchRetriever(DOC_STORE) READER = FARMReader(model_name_or_path='deepset/bert-base-cased-squad2', context_window_size=1500, use_gpu=True)

initialize pipeline

PIPELINE = ExtractiveQAPipeline(reader=READER, retriever=RETRIEVER)

initialize API

APP = FastAPI()

@APP.get('/query') async def get_query(q: str, retriever_limit: int = 10, reader_limit: int = 3): """Makes query to doc store via Haystack pipeline.

:param q: Query string representing the question being asked.
:type q: str
"""
# get answers
return PIPELINE.run(query=q,
                    top_k_retriever=retriever_limit,
                    top_k_reader=reader_limit)

from pyngrok import ngrok

Terminate open tunnels if exist

ngrok.kill()

Setting the authtoken (optional)

Get your authtoken from https://dashboard.ngrok.com/auth

ngrok.set_auth_token(NGROK_AUTH_TOKEN)

ngrok_tunnel = ngrok.connect(9200) print('Public URL:', ngrok_tunnel.public_url) nest_asyncio.apply() uvicorn.run(APP ) `

Link to Colab notebook https://colab.research.google.com/drive/191cyC5eXajgekBwJKs4hKmAiC_WmHuQ_?usp=sharing

brandenchan commented 3 years ago

Hi @shainaraza , in the colab notebook that you provide, I don't see any line that handles writing documents into the document store. Our recommendation is that you index your documents using Haystack via the document_store.write_documents(docs) method. If you have an existing Elasticsearch Database that you would like to use with Haystack, you will have to ensure that the fields in ES are named in a specific way

ZanSara commented 2 years ago

Hello @shainaraza, did you find a solution to your problem in the end? If so, please let us know :slightly_smiling_face:

shainaraza commented 2 years ago

Yes @ZanSara I found, will update this thread.

shainaraza commented 2 years ago

Colab was blocking the API address so I used ngrok to have a public address from colab, below he code (its little mixed, apologies for that but it worked) file.txt


  !pip install flask-ngrok
  from pyngrok import ngrok

  ngrok_process = ngrok.get_ngrok_process()

  try:
      ** Block until CTRL-C or some other terminating event
      ngrok_process.proc.wait()
  except KeyboardInterrupt:
      print(" Shutting down server.")

      ngrok.kill()

  from flask_ngrok import run_with_ngrok
  from flask import Flask, request

  from fastapi import FastAPI
  from haystack.reader.farm import FARMReader
  from haystack.pipeline import ExtractiveQAPipeline
  from haystack.document_stores import InMemoryDocumentStore
  DOC_STORE = InMemoryDocumentStore()

  doc_dir = "data/article_txt_got"
  s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
  fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
  dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
  print(dicts[:3])

  DOC_STORE.write_documents(dicts)

  from haystack.nodes import TfidfRetriever
  RETRIEVER = TfidfRetriever(DOC_STORE)
  READER = FARMReader(model_name_or_path='deepset/bert-base-cased-squad2',
                      context_window_size=1500,
                      use_gpu=True)
  ** initialize pipeline
  PIPELINE = ExtractiveQAPipeline(reader=READER, retriever=RETRIEVER)
  ** initialize API
  app = Flask(__name__)
  run_with_ngrok(app)   **starts ngrok when the app is run
  @app.route('/')
  def get_query():
      """Makes query to doc store via Haystack pipeline.

      :param q: Query string representing the question being asked.
      :type q: str
      """
      q = "covid-19?"
      ** get answers
      return PIPELINE.run(query=q, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})
  app.run( )
ZanSara commented 2 years ago

Thank you very much! I'll close this thread now, but this solution will be a good reference for the future :slightly_smiling_face: