Open BBC-Esq opened 7 months ago
Hello!
The issue originates in https://github.com/xlang-ai/instructor-embedding, which was created explicitly for Sentence Transformers 2.2.2
. They haven't kept their code up to date with the recent Sentence Transformer updates, hence the failures. This is why HuggingFaceInstructEmbeddings
fails while HuggingFaceEmbeddings
and HuggingFaceBgeEmbeddings
work.
A good solution would be to try this PR: https://github.com/xlang-ai/instructor-embedding/pull/112 with:
pip install git+https://github.com/SilasMarvin/instructor-embedding.git@silas-update-for-newer-sentence-transformers
and the most recent sentence-transformers. That combination should work correctly.
Thanks, I checked it out. Now I'm getting the error below. My program downloads the instructor models into a specific directory. It does not use the default "cache" location. I do this for various reasons. As such, I specify the path to the model rather than the Huggingface repo ID when instantiating the model...I'm guessing this is the reason why I'm getting this error...Any clue?
Traceback (most recent call last):
File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\gui_tabs_databases.py", line 23, in run
create_vector_db.run() # calls database_interactions.py
^^^^^^^^^^^^^^^^^^^^^^
File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 198, in run
embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 70, in initialize_vector_model
model = HuggingFaceInstructEmbeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\langchain_community\embeddings\huggingface.py", line 158, in __init__
self.client = INSTRUCTOR(
^^^^^^^^^^^
File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 191, in __init__
modules = self._load_sbert_model(
^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\InstructorEmbedding\instructor.py", line 455, in _load_sbert_model
model_path = snapshot_download(**download_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\huggingface_hub\utils\_validators.py", line 111, in _inner_fn
validate_repo_id(arg_value)
File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\huggingface_hub\utils\_validators.py", line 159, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'D:/Scripts/ChromaDB-Plugin-for-LM-Studio/v4_3 - working/Embedding_Models/hkunlp--instructor-base'. Use `repo_type` argument if needed.
I resolved this error by using the huggingface repo id instead:
hkunlp/instructor-base
I'm guessing this IS NOT a true fix, however, since I notice that the "_load_sbert_model" method within sentence-transformers
has a parameter named model_name_or_path
...implying that it'll accept a repo id or path. Here's my code snippet:
if "instructor" in embedding_model_name:
encode_kwargs['show_progress_bar'] = True
model = HuggingFaceInstructEmbeddings(
model_name="hkunlp/instructor-base",
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs,
)
To temporarily obviate the issue I simply tried "hkunlp/instructor-base" instead of "embedding_model_name"...I did this to get to the next troubleshooting step for the time being...
IT WORKED! The database was successfully created. MOREOVER, I was able to successfully search it!
SUMMARY:
The script provided at https://github.com/SilasMarvin/instructor-embedding/tree/silas-update-for-newer-sentence-transformers fixes the error TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token'
Question: Are you willing to modify SentenceTransformer
's _load_sbert_model
method such that it works with the original instructorembedding
library? That would make it unnecessary to rely on the modification by SilasMarvin? I only ask because instructorembedding
is obviously not being updated even though it's their responsibility to do so...
It seems to me (as a lay person) that you'd simply need to have an intermediary function between how the instructorembedding
library expects to load a model versus sentence-transformers
does it now. This would also make huggingfaceinstructembeddings
from langchain would work as-is. In the interest of full disclosure, I reviewed the huggingfaceinstructembeddings
class within Langchain's source code and, just like instructorembedding
, it hasn't been updated in eons so...
Basically, even though it's the instructorembedding
and/or Langchain's peoples' responsibilities to update their code in compliance with sentence-transformers
, I'm asking if sentence-transformers
would accommodate them and provide a fix in its source code instead?
The benefit would be that Instructor models would work with newer versions of the sentence-transformers
library out of the box and people like me could still use pip install instructorembedding
instead relying on a specific branch of an unofficial fork of the instructorembedding
repo. Doesn't hurt to ask, right?
Thanks again. Please let me know if there's a way I can contribute.
FINALLY, regarding the error of not being able to load a model locally, I finally solved this issue by using the cache_folder
parameter from langchain specified here:
I assume that this connects with the cache_folder
parameter within sentence-transformers
here:
https://www.sbert.net/docs/package_reference/SentenceTransformer.html
So this narrow issue, at least, seems solved. Just thought others might want to know.
The fix for this just got merged into Instructor Embedding: https://github.com/xlang-ai/instructor-embedding/commit/5cca65eb0ed78ab354b086a5386fb2c528809caa
This is a challenging issue that I've been working on...First, here is my entire script:
SCRIPT
``` import shutil import yaml import gc from langchain_community.docstore.document import Document from langchain_community.embeddings import HuggingFaceInstructEmbeddings, HuggingFaceEmbeddings, HuggingFaceBgeEmbeddings from langchain_community.vectorstores import TileDB from document_processor import load_documents, split_documents from loader_images import specify_image_loader import torch from utilities import validate_symbolic_links, my_cprint from pathlib import Path import os import logging from PySide6.QtCore import QDir import time import pickle logging.basicConfig( level=logging.INFO, format='%(name)s - %(pathname)s:%(lineno)s - %(funcName)s' ) logging.getLogger('chromadb.db.duckdb').setLevel(logging.WARNING) logging.getLogger('sentence_transformers').setLevel(logging.WARNING) class CreateVectorDB: def __init__(self, database_name): self.ROOT_DIRECTORY = Path(__file__).resolve().parent self.SOURCE_DIRECTORY = self.ROOT_DIRECTORY / "Docs_for_DB" self.PERSIST_DIRECTORY = self.ROOT_DIRECTORY / "Vector_DB" / database_name self.SAVE_JSON_DIRECTORY = self.ROOT_DIRECTORY / "Docs_for_DB" / database_name def load_config(self, root_directory): with open(root_directory / "config.yaml", 'r', encoding='utf-8') as stream: return yaml.safe_load(stream) def initialize_vector_model(self, embedding_model_name, config_data): EMBEDDING_MODEL_NAME = config_data.get("EMBEDDING_MODEL_NAME") compute_device = config_data['Compute_Device']['database_creation'] model_kwargs = {"device": compute_device} encode_kwargs = {'normalize_embeddings': False, 'batch_size': 8} if compute_device.lower() == 'cpu': encode_kwargs['batch_size'] = 2 else: batch_size_mapping = { 'sentence-t5-xxl': 1, ('instructor-xl', 'sentence-t5-xl'): 2, 'instructor-large': 3, ('jina-embedding-l', 'bge-large', 'gte-large', 'roberta-large', 'mxbai-embed-large-v1'): 4, 'jina-embedding-s': 9, ('bge-small', 'gte-small'): 10, ('MiniLM',): 20, } for key, value in batch_size_mapping.items(): if isinstance(key, tuple): if any(model_name_part in EMBEDDING_MODEL_NAME for model_name_part in key): encode_kwargs['batch_size'] = value break else: if key in EMBEDDING_MODEL_NAME: encode_kwargs['batch_size'] = value break my_cprint(f"Vector model initialized with a batch size of {encode_kwargs['batch_size']}", "blue") if "instructor" in embedding_model_name: embed_instruction = config_data['embedding-models']['instructor'].get('embed_instruction') query_instruction = config_data['embedding-models']['instructor'].get('query_instruction') encode_kwargs['show_progress_bar'] = True model = HuggingFaceInstructEmbeddings( model_name=embedding_model_name, model_kwargs=model_kwargs, embed_instruction=embed_instruction, query_instruction=query_instruction, encode_kwargs=encode_kwargs ) elif "bge" in embedding_model_name: query_instruction = config_data['embedding-models']['bge'].get('query_instruction') encode_kwargs['show_progress_bar'] = True model = HuggingFaceBgeEmbeddings( model_name=embedding_model_name, model_kwargs=model_kwargs, query_instruction=query_instruction, encode_kwargs=encode_kwargs ) else: model = HuggingFaceEmbeddings( model_name=embedding_model_name, show_progress=True, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs ) return model, encode_kwargs def create_database(self, texts, embeddings): my_cprint("Creating vectors and database...\n\n NOTE:\n\nNOTE: The progress bar only relates to computing vectors, not inserting them into the database. Rest assured, after it reaches 100% it is still working unless you get an error message.\n", "yellow") start_time = time.time() if not self.PERSIST_DIRECTORY.exists(): self.PERSIST_DIRECTORY.mkdir(parents=True, exist_ok=True) db = TileDB.from_documents( documents=texts, embedding=embeddings, index_uri=str(self.PERSIST_DIRECTORY), allow_dangerous_deserialization=True, metric="euclidean", index_type="FLAT", ) print("Database created.") end_time = time.time() elapsed_time = end_time - start_time my_cprint("Database saved.", "cyan") print(f"Creation of vectors and inserting into the database took {elapsed_time:.2f} seconds.") def save_documents_to_json(self, json_docs_to_save): self.SAVE_JSON_DIRECTORY.mkdir(parents=True, exist_ok=True) for document in json_docs_to_save: document_hash = document.metadata.get('hash', None) if document_hash: json_filename = f"{document_hash}.json" json_file_path = self.SAVE_JSON_DIRECTORY / json_filename actual_file_path = document.metadata.get('file_path') if os.path.islink(actual_file_path): resolved_path = os.path.realpath(actual_file_path) document.metadata['file_path'] = resolved_path document_json = document.json(indent=4) with open(json_file_path, 'w', encoding='utf-8') as json_file: json_file.write(document_json) else: print("Warning: Document missing 'hash' in metadata. Skipping JSON creation.") def load_audio_documents(self, source_dir: Path = None) -> list: if source_dir is None: source_dir = self.SOURCE_DIRECTORY json_paths = [f for f in source_dir.iterdir() if f.suffix.lower() == '.json'] docs = [] for json_path in json_paths: try: with open(json_path, 'r', encoding='utf-8') as json_file: json_str = json_file.read() doc = Document.parse_raw(json_str) docs.append(doc) except Exception as e: my_cprint(f"Error loading {json_path}: {e}", "red") return docs def clear_docs_for_db_folder(self): for item in self.SOURCE_DIRECTORY.iterdir(): if item.is_file() or item.is_symlink(): try: item.unlink() except Exception as e: print(f"Failed to delete {item}: {e}") def run(self): config_data = self.load_config(self.ROOT_DIRECTORY) EMBEDDING_MODEL_NAME = config_data.get("EMBEDDING_MODEL_NAME") # load non-image/non-audio documents documents = load_documents(self.SOURCE_DIRECTORY) # load image documents image_documents = specify_image_loader() documents.extend(image_documents) json_docs_to_save = documents # load audio documents audio_documents = self.load_audio_documents() # Now calling the method internally documents.extend(audio_documents) if len(audio_documents) > 0: print(f"Loaded {len(audio_documents)} audio transcription(s)...") # split each document in the list of documents texts = split_documents(documents) # initialize vector model embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data) # create database self.create_database(texts, embeddings) self.save_documents_to_json(json_docs_to_save) del embeddings.client del embeddings torch.cuda.empty_cache() gc.collect() my_cprint("Embedding model removed from memory.", "red") # clear ingest folder self.clear_docs_for_db_folder() print("Cleared all files and symlinks in Docs_for_DB folder.") ```This works fine when using
sentence-transformers==2.2.2
. However, when I upgrade tosentence-transformers==2.6.1
I get this error:ERROR
``` Traceback (most recent call last): File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\gui_tabs_databases.py", line 23, in run create_vector_db.run() # calls database_interactions.py ^^^^^^^^^^^^^^^^^^^^^^ File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 193, in run embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 72, in initialize_vector_model model = HuggingFaceInstructEmbeddings( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\langchain_community\embeddings\huggingface.py", line 153, in __init__ self.client = INSTRUCTOR( ^^^^^^^^^^^ File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 191, in __init__ modules = self._load_sbert_model( ^^^^^^^^^^^^^^^^^^^^^^^ TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token' ```I've verified that when using a
BGE
model (viaHuggingFaceBgeEmbeddings
),GTE
model (viaHuggingFaceEmbeddings
) andall-mpnet-base-v2
(viaHuggingFaceEmbeddings
) everything works fine. I've tried every which way to get it to work...Since I really like the "instructor" models in my program, this forces me to stay at
sentence-transformers==2.2.2
or, alternatively, abandon them in order to upgrade so I can use newer models (e.g.mxbai-embed-large-v1
). I wouldn't normally ask, but I've spend dozens of hours trying to solve this...ranging from usingSentenceTransformers
directly pursuant to the API on your website to custom wrappers, etc.Can anyone help me and/or @tomaarsen in particular if he has time? I don't know if this is an issue for
sentence-transformers
itself, its integration withHuggingFaceInstructEmbeddings
from Langchain, or just my code...Thanks in advance![EDIT] I am aware that Instructor models are unique in that the prompt is not included in pooling, as stated on your website's instructions/examples, and I DID examine
SentenceTransformers
itself and see where you took that into account:(taken from version 2.6.0)
I just simply can't figure out why
HuggingFaceInstructEmbeddings
isn't working whileHuggingFaceEmbeddings
andHuggingFaceBgeEmbeddings
work fine when I pip install sentence-transformers above 2.2.2...This is literally the only issue that has stymied my program from upgrading the crucial dependency that is
sentence-transformers
...Thanks again and love the repo!