Dicklesworthstone / swiss_army_llama

A FastAPI service for semantic text search using precomputed embeddings and advanced similarity measures, with built-in support for various file types through textract.
938 stars 51 forks source link

failed to load model #3

Closed samuelsoc closed 1 year ago

samuelsoc commented 1 year ago

Hello @Dicklesworthstone, thank you for the incredible work, it's what I've been trying to do as well. I'm running it following the instructions, and I'm encountering an error that says it can't find the models, but they did get downloaded.

(env_llama) samuelrg@mlserverdsturing:~/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service$ python llama_2_embeddings_fastapi_server.py /home/samuelrg/.conda/envs/env_llama/lib/python3.9/site-packages/pydantic/_internal/fields.py:127: UserWarning: Field "model_name" has conflict with protected namespace "model".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = (). warnings.warn( 2023-08-31 03:48:09,907 - INFO - USE_RAMDISK is set to: False INFO: Started server process [3494] INFO: Waiting for application startup. 2023-08-31 03:48:09,966 - INFO - Initializing database, creating tables, and setting SQLite PRAGMAs... 2023-08-31 03:48:09,972 - INFO - Executed SQLite PRAGMA: PRAGMA journal_mode=WAL; 2023-08-31 03:48:09,972 - INFO - Justification: Set SQLite to use Write-Ahead Logging (WAL) mode (from default DELETE mode) so that reads and writes can occur simultaneously 2023-08-31 03:48:09,973 - INFO - Executed SQLite PRAGMA: PRAGMA synchronous = NORMAL; 2023-08-31 03:48:09,973 - INFO - Justification: Set synchronous mode to NORMAL (from FULL) so that writes are not blocked by reads 2023-08-31 03:48:09,974 - INFO - Executed SQLite PRAGMA: PRAGMA cache_size = -1048576; 2023-08-31 03:48:09,974 - INFO - Justification: Set cache size to 1GB (from default 2MB) so that more data can be cached in memory and not read from disk; to make this 256MB, set it to -262144 instead 2023-08-31 03:48:09,975 - INFO - Executed SQLite PRAGMA: PRAGMA busy_timeout = 2000; 2023-08-31 03:48:09,976 - INFO - Justification: Increase the busy timeout to 2 seconds so that the database waits 2023-08-31 03:48:09,977 - INFO - Executed SQLite PRAGMA: PRAGMA wal_autocheckpoint = 100; 2023-08-31 03:48:09,977 - INFO - Justification: Set the WAL autocheckpoint to 100 (from default 1000) so that the WAL file is checkpointed more frequently 2023-08-31 03:48:09,984 - INFO - Database initialization completed. 2023-08-31 03:48:09,984 - INFO - Initializing process of creating set of input hash/model_name combinations that are either currently being processed or have already been processed... 2023-08-31 03:48:10,025 - INFO - Checking models directory... 2023-08-31 03:48:10,025 - INFO - Models directory exists: /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models 2023-08-31 03:48:10,025 - INFO - File already exists: /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/llama2_7b_chat_uncensored.ggmlv3.q3_K_L.bin 2023-08-31 03:48:10,025 - INFO - File already exists: /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/wizardlm-1.0-uncensored-llama2-13b.ggmlv3.q3_K_L.bin 2023-08-31 03:48:10,026 - INFO - File already exists: /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/ggml-model-f32.bin 2023-08-31 03:48:10,026 - INFO - Model downloads completed. gguf_init_from_file: invalid magic number 67676a74 error loading model: llama_model_loader: failed to load model from /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/llama2_7b_chat_uncensored.ggmlv3.q3_K_L.bin

llama_load_model_from_file: failed to load model 2023-08-31 03:48:10,027 - ERROR - Exception occurred while loading the model: 1 validation error for LlamaCppEmbeddings root Could not load Llama model from path: /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/llama2_7b_chat_uncensored.ggmlv3.q3_K_L.bin. Received error (type=value_error) 2023-08-31 03:48:10,028 - ERROR - No model file found matching: llama2_7b_chat_uncensored.ggmlv3.q3_K_L.bin gguf_init_from_file: invalid magic number 67676a74 error loading model: llama_model_loader: failed to load model from /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/wizardlm-1.0-uncensored-llama2-13b.ggmlv3.q3_K_L.bin

llama_load_model_from_file: failed to load model 2023-08-31 03:48:10,029 - ERROR - Exception occurred while loading the model: 1 validation error for LlamaCppEmbeddings root Could not load Llama model from path: /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/wizardlm-1.0-uncensored-llama2-13b.ggmlv3.q3_K_L.bin. Received error (type=value_error) 2023-08-31 03:48:10,029 - ERROR - No model file found matching: wizardlm-1.0-uncensored-llama2-13b.ggmlv3.q3_K_L.bin gguf_init_from_file: invalid magic number 67676d6c error loading model: llama_model_loader: failed to load model from /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/ggml-model-f32.bin

llama_load_model_from_file: failed to load model 2023-08-31 03:48:10,030 - ERROR - Exception occurred while loading the model: 1 validation error for LlamaCppEmbeddings root Could not load Llama model from path: /home/samuelrg/TEXT_ANALYSIS/API_LLAMA_EMBEDDING/llama_embeddings_fastapi_service/models/ggml-model-f32.bin. Received error (type=value_error) 2023-08-31 03:48:10,030 - ERROR - No model file found matching: ggml-model-f32.bin INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8089/ (Press CTRL+C to quit)

wenkph commented 1 year ago

https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/discussions/14

Might be related to this, as the code depends on llama-cpp-python and in the requirements, no versions are specified, so lateset are pulled. I will experiment a bit with downgrading and update if I find something.

Or we just convert the downloaded models into the newer format during the setup, see https://github.com/abetlen/llama-cpp-python

wenkph commented 1 year ago

Short update:

The problem comes from ggml being discontinued and replaced by gguf. I am able to calculate embeddings by just replacing the downloaded models in llama_2_embeddings_fastapi_server.py -> download_models() with a gguf model, i.e. ''https://huggingface.co/TheBloke/Yarn-Llama-2-13B-64K-GGUF/resolve/main/yarn-llama-2-13b-64k.Q4_K_M.gguf''

However, this will break the functionality of the server at some points. An obvious one for me was the "get_list_of_available_model_names", as it explicitly searches for .bin files (which are now no longer .bin, but .gguf. It might be that the above "update" leads to other unexpected behavior, I have not tested it extensively.

The error might also be resolvable by fixing the version of the llama-cpp-python library to a version before Aug. 21, when the update to gguf was rolled out. I have not looked into that. But this might be the preferrable solution if you want a quick fix that preserves all originally intended functionality.

Dicklesworthstone commented 1 year ago

Great, thanks for the good investigative work. I hate how often the ggml project changes and deprecates their file format! Hopefully this time will be the last.

You’re right, I’ll need to update to make my library work. I’ll poke around, it might be as simple as just looking for the new file extension.

On Sat, Sep 2, 2023 at 4:56 PM wenkph @.***> wrote:

Short update:

The problem comes from ggml being discontinued and replaced by gguf. I am able to calculate embeddings by just replacing the downloaded models in llama_2_embeddings_fastapi_server.py -> download_models() with a gguf model, i.e. '' https://huggingface.co/TheBloke/Yarn-Llama-2-13B-64K-GGUF/resolve/main/yarn-llama-2-13b-64k.Q4_K_M.gguf ''

However, this will break the functionality of the server at some points. An obvious one for me was the "get_list_of_available_model_names", as it explicitly searches for .bin files (which are now no longer .bin, but .gguf. It might be that the above "update" leads to other unexpected behavior, I have not tested it extensively.

The error might also be resolvable by fixing the version of the llama-cpp-python library to a version before Aug. 21, when the update to gguf was rolled out. I have not looked into that. But this might be the preferrable solution if you want a quick fix that preserves all originally intended functionality.

— Reply to this email directly, view it on GitHub https://github.com/Dicklesworthstone/llama_embeddings_fastapi_service/issues/3#issuecomment-1703939130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILNF3U6GNJ7HWHMTFHEGU3XYOMPLANCNFSM6AAAAAA4FOAAT4 . You are receiving this because you were mentioned.Message ID: <Dicklesworthstone/llama_embeddings_fastapi_service/issues/3/1703939130@ github.com>

Dicklesworthstone commented 1 year ago

OK, I updated it to work with the gguf model files, and also changed the default models to the new Yarn models with 128k context. Also I removed the bge base model which I don't think ever worked properly. Also disabled the RAM disk in the default .env file.

samuelsoc commented 1 year ago

It works perfect!