Danswer not starting: None of PyTorch, TensorFlow >= 2.0, or Flax have been found

robinwo commented 3 months ago

I followed the quickstart tutorial step-by-step, tried this on both EC2 and DigitalOcean, but the same issue persists.

While seeing: Nginx is not ready yet, retrying in 5 seconds..., in the docker logs of danswer-stack-api_server-1 I read the following: None of PyTorch, TensorFlow >= 2.0, or Flax have been found.

I manually installed pytorch on the droplet, but that does change nothing (and is not available in the container?). What makes me wonder, am I doing something wrong? On my Mac it runs like a charm locally, but spinning it up in the cloud does not work for me.

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
INFO:     Started server process [13]
INFO:     Waiting for application startup.
04/25/2024 06:16:20 PM variable_functionality.py  27 : Fetching versioned implementation for danswer.auth.users.verify_auth_setting
04/25/2024 06:16:20 PM             users.py  69 : Using Auth Type: basic
04/25/2024 06:16:21 PM              main.py 152 : Danswer API Key: dn_5zI2FIsbLXhjvRIi5UdvEiu8ftGOXJXPjhVOqkUe
04/25/2024 06:16:21 PM              main.py 160 : Using LLM Provider: openai
04/25/2024 06:16:21 PM              main.py 162 : Using LLM Model Version: gpt-4
04/25/2024 06:16:21 PM              main.py 164 : Using Fast LLM Model Version: gpt-3.5-turbo-16k-0613
04/25/2024 06:16:21 PM          chat_llm.py  49 : LLM Model Class: ChatLiteLLM, Model Config: {'model': 'openai/gpt-4', 'request_timeout': 60.0, 'model_kwargs': {'frequency_penalty': 0, 'presence_penalty': 0}, 'n': 1, 'max_tokens': 1024}
04/25/2024 06:16:21 PM              main.py 200 : Using Embedding model: "intfloat/e5-base-v2"
04/25/2024 06:16:21 PM              main.py 202 : Query embedding prefix: "query: "
04/25/2024 06:16:21 PM              main.py 203 : Passage embedding prefix: "passage: "
04/25/2024 06:16:21 PM              main.py 210 : Verifying query preprocessing (NLTK) data is downloaded
04/25/2024 06:16:21 PM     search_runner.py  45 : stopwords is already downloaded.
04/25/2024 06:16:21 PM     search_runner.py  48 : Downloading wordnet...
04/25/2024 06:16:21 PM     search_runner.py  50 : wordnet downloaded successfully.
04/25/2024 06:16:21 PM     search_runner.py  45 : punkt is already downloaded.
04/25/2024 06:16:21 PM              main.py 213 : Verifying default connector/credential exist.
04/25/2024 06:16:21 PM              main.py 218 : Loading default Prompts and Personas
04/25/2024 06:16:21 PM              main.py 222 : Verifying Document Index(s) is/are available.
04/25/2024 06:16:21 PM              main.py 244 : Model Server: http://inference_model_server:9000
04/25/2024 06:16:22 PM search_nlp_models.py 199 : Failed to run test embedding, retrying in 5 seconds...
04/25/2024 06:16:27 PM search_nlp_models.py 199 : Failed to run test embedding, retrying in 5 seconds...
04/25/2024 06:16:32 PM search_nlp_models.py 199 : Failed to run test embedding, retrying in 5 seconds...
04/25/2024 06:16:37 PM search_nlp_models.py 199 : Failed to run test embedding, retrying in 5 seconds...
04/25/2024 06:16:42 PM search_nlp_models.py 199 : Failed to run test embedding, retrying in 5 seconds...
04/25/2024 06:16:47 PM search_nlp_models.py 199 : Failed to run test embedding, retrying in 5 seconds...

dstrzelec commented 3 months ago

I'm getting the same...

After the above posted log messages, it eventually times out with this exception:

ERROR:    Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 734, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/local/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/main.py", line 245, in lifespan
    warm_up_encoders(
  File "/app/danswer/search/search_nlp_models.py", line 203, in warm_up_encoders
    raise Exception("Failed to run test embedding.")
Exception: Failed to run test embedding.

ERROR:    Application startup failed. Exiting.

Weves commented 3 months ago

Hey @dstrzelec / @robinwo! Could you post the output of:

docker logs danswer-stack-inference_model_server-1

robinwo commented 3 months ago

Interesting, does not exist: Error response from daemon: No such container: danswer-stack-inference_model_server-1

Containers running:

[ec2-user@ip-172-31-35-5 ~]$ sudo docker ps
CONTAINER ID   IMAGE                               COMMAND                  CREATED              STATUS              PORTS                                                                                      NAMES
682f52d784d2   nginx:1.23.4-alpine                 "/docker-entrypoint.…"   About a minute ago   Up About a minute   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp                   danswer-stack-nginx-1
da3313ef372a   danswer/danswer-web-server:latest   "docker-entrypoint.s…"   About a minute ago   Up About a minute                                                                                              danswer-stack-web_server-1
a71dc28f4843   danswer/danswer-backend:latest      "/bin/sh -c 'alembic…"   About a minute ago   Up About a minute                                                                                              danswer-stack-api_server-1
79d4a1811198   vespaengine/vespa:8.277.17          "/usr/local/bin/star…"   About a minute ago   Up About a minute   0.0.0.0:8081->8081/tcp, :::8081->8081/tcp, 0.0.0.0:19071->19071/tcp, :::19071->19071/tcp   danswer-stack-index-1
0b3a2559b023   postgres:15.2-alpine                "docker-entrypoint.s…"   About a minute ago   Up About a minute   5432/tcp                                                                                   danswer-stack-relational_db-1

Weves commented 3 months ago

Could you try pulling in the latest version of Danswer (git pull), and trying to bring things up again?

robinwo commented 3 months ago

Yup I'm on the latest main branch.

After letting Danswer run for a while, I get the following final closing statement:

ERROR:    Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 734, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/local/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/main.py", line 245, in lifespan
    warm_up_encoders(
  File "/app/danswer/search/search_nlp_models.py", line 203, in warm_up_encoders
    raise Exception("Failed to run test embedding.")
Exception: Failed to run test embedding.

ERROR:    Application startup failed. Exiting.

Brand new EC2 instance, no other configurations done besides following the Danswer getting started guide. SSH into the machine, start copy/past'ing commands.

robinwo commented 3 months ago

Solved with this workaround:

Follow this guide, but do NOT run the letsencrypt script (stop there)
Install docker-compose, not by default available on AWS Linux, and needed to support -f flags in docker command
Run docker-compose -f docker-compose.dev.yml -p danswer-stack up -d --build --force-recreate This will install everything including the inference model server container.

Only then, run the letsencrypt to enable HTTPS for your instance. This will rebuild some of the containers, and will keep the inference model server running.

Looks like the QuickStart needs an update, and the letsencrypt script does skip some installation steps.

Weves commented 3 months ago

Thanks for pointing this out @robinwo ! Yea, I think it's an issue with the letsencrypt script—will have a fix out soon.

Weves commented 3 months ago

Fixed in https://github.com/danswer-ai/danswer/pull/1395! Thanks for the heads up all!

danswer-ai / danswer

Danswer not starting: None of PyTorch, TensorFlow >= 2.0, or Flax have been found #1384