UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.46k stars 2.5k forks source link

INSTRUCTOR models not working with sentence-transformers via langchain #2567

Open BBC-Esq opened 8 months ago

BBC-Esq commented 8 months ago

This is a challenging issue that I've been working on...First, here is my entire script:

SCRIPT ``` import shutil import yaml import gc from langchain_community.docstore.document import Document from langchain_community.embeddings import HuggingFaceInstructEmbeddings, HuggingFaceEmbeddings, HuggingFaceBgeEmbeddings from langchain_community.vectorstores import TileDB from document_processor import load_documents, split_documents from loader_images import specify_image_loader import torch from utilities import validate_symbolic_links, my_cprint from pathlib import Path import os import logging from PySide6.QtCore import QDir import time import pickle logging.basicConfig( level=logging.INFO, format='%(name)s - %(pathname)s:%(lineno)s - %(funcName)s' ) logging.getLogger('chromadb.db.duckdb').setLevel(logging.WARNING) logging.getLogger('sentence_transformers').setLevel(logging.WARNING) class CreateVectorDB: def __init__(self, database_name): self.ROOT_DIRECTORY = Path(__file__).resolve().parent self.SOURCE_DIRECTORY = self.ROOT_DIRECTORY / "Docs_for_DB" self.PERSIST_DIRECTORY = self.ROOT_DIRECTORY / "Vector_DB" / database_name self.SAVE_JSON_DIRECTORY = self.ROOT_DIRECTORY / "Docs_for_DB" / database_name def load_config(self, root_directory): with open(root_directory / "config.yaml", 'r', encoding='utf-8') as stream: return yaml.safe_load(stream) def initialize_vector_model(self, embedding_model_name, config_data): EMBEDDING_MODEL_NAME = config_data.get("EMBEDDING_MODEL_NAME") compute_device = config_data['Compute_Device']['database_creation'] model_kwargs = {"device": compute_device} encode_kwargs = {'normalize_embeddings': False, 'batch_size': 8} if compute_device.lower() == 'cpu': encode_kwargs['batch_size'] = 2 else: batch_size_mapping = { 'sentence-t5-xxl': 1, ('instructor-xl', 'sentence-t5-xl'): 2, 'instructor-large': 3, ('jina-embedding-l', 'bge-large', 'gte-large', 'roberta-large', 'mxbai-embed-large-v1'): 4, 'jina-embedding-s': 9, ('bge-small', 'gte-small'): 10, ('MiniLM',): 20, } for key, value in batch_size_mapping.items(): if isinstance(key, tuple): if any(model_name_part in EMBEDDING_MODEL_NAME for model_name_part in key): encode_kwargs['batch_size'] = value break else: if key in EMBEDDING_MODEL_NAME: encode_kwargs['batch_size'] = value break my_cprint(f"Vector model initialized with a batch size of {encode_kwargs['batch_size']}", "blue") if "instructor" in embedding_model_name: embed_instruction = config_data['embedding-models']['instructor'].get('embed_instruction') query_instruction = config_data['embedding-models']['instructor'].get('query_instruction') encode_kwargs['show_progress_bar'] = True model = HuggingFaceInstructEmbeddings( model_name=embedding_model_name, model_kwargs=model_kwargs, embed_instruction=embed_instruction, query_instruction=query_instruction, encode_kwargs=encode_kwargs ) elif "bge" in embedding_model_name: query_instruction = config_data['embedding-models']['bge'].get('query_instruction') encode_kwargs['show_progress_bar'] = True model = HuggingFaceBgeEmbeddings( model_name=embedding_model_name, model_kwargs=model_kwargs, query_instruction=query_instruction, encode_kwargs=encode_kwargs ) else: model = HuggingFaceEmbeddings( model_name=embedding_model_name, show_progress=True, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs ) return model, encode_kwargs def create_database(self, texts, embeddings): my_cprint("Creating vectors and database...\n\n NOTE:\n\nNOTE: The progress bar only relates to computing vectors, not inserting them into the database. Rest assured, after it reaches 100% it is still working unless you get an error message.\n", "yellow") start_time = time.time() if not self.PERSIST_DIRECTORY.exists(): self.PERSIST_DIRECTORY.mkdir(parents=True, exist_ok=True) db = TileDB.from_documents( documents=texts, embedding=embeddings, index_uri=str(self.PERSIST_DIRECTORY), allow_dangerous_deserialization=True, metric="euclidean", index_type="FLAT", ) print("Database created.") end_time = time.time() elapsed_time = end_time - start_time my_cprint("Database saved.", "cyan") print(f"Creation of vectors and inserting into the database took {elapsed_time:.2f} seconds.") def save_documents_to_json(self, json_docs_to_save): self.SAVE_JSON_DIRECTORY.mkdir(parents=True, exist_ok=True) for document in json_docs_to_save: document_hash = document.metadata.get('hash', None) if document_hash: json_filename = f"{document_hash}.json" json_file_path = self.SAVE_JSON_DIRECTORY / json_filename actual_file_path = document.metadata.get('file_path') if os.path.islink(actual_file_path): resolved_path = os.path.realpath(actual_file_path) document.metadata['file_path'] = resolved_path document_json = document.json(indent=4) with open(json_file_path, 'w', encoding='utf-8') as json_file: json_file.write(document_json) else: print("Warning: Document missing 'hash' in metadata. Skipping JSON creation.") def load_audio_documents(self, source_dir: Path = None) -> list: if source_dir is None: source_dir = self.SOURCE_DIRECTORY json_paths = [f for f in source_dir.iterdir() if f.suffix.lower() == '.json'] docs = [] for json_path in json_paths: try: with open(json_path, 'r', encoding='utf-8') as json_file: json_str = json_file.read() doc = Document.parse_raw(json_str) docs.append(doc) except Exception as e: my_cprint(f"Error loading {json_path}: {e}", "red") return docs def clear_docs_for_db_folder(self): for item in self.SOURCE_DIRECTORY.iterdir(): if item.is_file() or item.is_symlink(): try: item.unlink() except Exception as e: print(f"Failed to delete {item}: {e}") def run(self): config_data = self.load_config(self.ROOT_DIRECTORY) EMBEDDING_MODEL_NAME = config_data.get("EMBEDDING_MODEL_NAME") # load non-image/non-audio documents documents = load_documents(self.SOURCE_DIRECTORY) # load image documents image_documents = specify_image_loader() documents.extend(image_documents) json_docs_to_save = documents # load audio documents audio_documents = self.load_audio_documents() # Now calling the method internally documents.extend(audio_documents) if len(audio_documents) > 0: print(f"Loaded {len(audio_documents)} audio transcription(s)...") # split each document in the list of documents texts = split_documents(documents) # initialize vector model embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data) # create database self.create_database(texts, embeddings) self.save_documents_to_json(json_docs_to_save) del embeddings.client del embeddings torch.cuda.empty_cache() gc.collect() my_cprint("Embedding model removed from memory.", "red") # clear ingest folder self.clear_docs_for_db_folder() print("Cleared all files and symlinks in Docs_for_DB folder.") ```

This works fine when using sentence-transformers==2.2.2. However, when I upgrade to sentence-transformers==2.6.1 I get this error:

ERROR ``` Traceback (most recent call last): File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\gui_tabs_databases.py", line 23, in run create_vector_db.run() # calls database_interactions.py ^^^^^^^^^^^^^^^^^^^^^^ File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 193, in run embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 72, in initialize_vector_model model = HuggingFaceInstructEmbeddings( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\langchain_community\embeddings\huggingface.py", line 153, in __init__ self.client = INSTRUCTOR( ^^^^^^^^^^^ File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 191, in __init__ modules = self._load_sbert_model( ^^^^^^^^^^^^^^^^^^^^^^^ TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token' ```

I've verified that when using a BGE model (via HuggingFaceBgeEmbeddings), GTE model (via HuggingFaceEmbeddings) and all-mpnet-base-v2 (via HuggingFaceEmbeddings) everything works fine. I've tried every which way to get it to work...

Since I really like the "instructor" models in my program, this forces me to stay at sentence-transformers==2.2.2 or, alternatively, abandon them in order to upgrade so I can use newer models (e.g. mxbai-embed-large-v1). I wouldn't normally ask, but I've spend dozens of hours trying to solve this...ranging from using SentenceTransformers directly pursuant to the API on your website to custom wrappers, etc.

Can anyone help me and/or @tomaarsen in particular if he has time? I don't know if this is an issue for sentence-transformers itself, its integration with HuggingFaceInstructEmbeddings from Langchain, or just my code...Thanks in advance!

[EDIT] I am aware that Instructor models are unique in that the prompt is not included in pooling, as stated on your website's instructions/examples, and I DID examine SentenceTransformers itself and see where you took that into account:

        if model_name_or_path in ("hkunlp/instructor-base", "hkunlp/instructor-large", "hkunlp/instructor-xl"):
            self.set_pooling_include_prompt(include_prompt=False)
        elif (
            model_name_or_path
            and "/" in model_name_or_path
            and "instructor" in model_name_or_path.split("/")[1].lower()
        ):
            if any([module.include_prompt for module in self if isinstance(module, Pooling)]):
                logger.warning(
                    "Instructor models require `include_prompt=False` in the pooling configuration. "
                    "Either update the model configuration or call `model.set_pooling_include_prompt(False)` after loading the model."
                )

(taken from version 2.6.0)

I just simply can't figure out why HuggingFaceInstructEmbeddings isn't working while HuggingFaceEmbeddings and HuggingFaceBgeEmbeddings work fine when I pip install sentence-transformers above 2.2.2...

This is literally the only issue that has stymied my program from upgrading the crucial dependency that is sentence-transformers...Thanks again and love the repo!

tomaarsen commented 8 months ago

Hello!

The issue originates in https://github.com/xlang-ai/instructor-embedding, which was created explicitly for Sentence Transformers 2.2.2. They haven't kept their code up to date with the recent Sentence Transformer updates, hence the failures. This is why HuggingFaceInstructEmbeddings fails while HuggingFaceEmbeddings and HuggingFaceBgeEmbeddings work.

A good solution would be to try this PR: https://github.com/xlang-ai/instructor-embedding/pull/112 with:

pip install git+https://github.com/SilasMarvin/instructor-embedding.git@silas-update-for-newer-sentence-transformers

and the most recent sentence-transformers. That combination should work correctly.

BBC-Esq commented 8 months ago

Thanks, I checked it out. Now I'm getting the error below. My program downloads the instructor models into a specific directory. It does not use the default "cache" location. I do this for various reasons. As such, I specify the path to the model rather than the Huggingface repo ID when instantiating the model...I'm guessing this is the reason why I'm getting this error...Any clue?

Traceback (most recent call last):
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\gui_tabs_databases.py", line 23, in run
    create_vector_db.run() # calls database_interactions.py
    ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 198, in run
    embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 70, in initialize_vector_model
    model = HuggingFaceInstructEmbeddings(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\langchain_community\embeddings\huggingface.py", line 158, in __init__
    self.client = INSTRUCTOR(
                  ^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 191, in __init__
    modules = self._load_sbert_model(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\InstructorEmbedding\instructor.py", line 455, in _load_sbert_model
    model_path = snapshot_download(**download_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\huggingface_hub\utils\_validators.py", line 111, in _inner_fn
    validate_repo_id(arg_value)
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\huggingface_hub\utils\_validators.py", line 159, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'D:/Scripts/ChromaDB-Plugin-for-LM-Studio/v4_3 - working/Embedding_Models/hkunlp--instructor-base'. Use `repo_type` argument if needed.
BBC-Esq commented 8 months ago

I resolved this error by using the huggingface repo id instead:

hkunlp/instructor-base

I'm guessing this IS NOT a true fix, however, since I notice that the "_load_sbert_model" method within sentence-transformers has a parameter named model_name_or_path...implying that it'll accept a repo id or path. Here's my code snippet:

        if "instructor" in embedding_model_name:
            encode_kwargs['show_progress_bar'] = True

            model = HuggingFaceInstructEmbeddings(
                model_name="hkunlp/instructor-base",
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs,
            )

To temporarily obviate the issue I simply tried "hkunlp/instructor-base" instead of "embedding_model_name"...I did this to get to the next troubleshooting step for the time being...

IT WORKED! The database was successfully created. MOREOVER, I was able to successfully search it!

SUMMARY:

The script provided at https://github.com/SilasMarvin/instructor-embedding/tree/silas-update-for-newer-sentence-transformers fixes the error TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token'

Question: Are you willing to modify SentenceTransformer's _load_sbert_model method such that it works with the original instructorembedding library? That would make it unnecessary to rely on the modification by SilasMarvin? I only ask because instructorembedding is obviously not being updated even though it's their responsibility to do so...

It seems to me (as a lay person) that you'd simply need to have an intermediary function between how the instructorembedding library expects to load a model versus sentence-transformers does it now. This would also make huggingfaceinstructembeddings from langchain would work as-is. In the interest of full disclosure, I reviewed the huggingfaceinstructembeddings class within Langchain's source code and, just like instructorembedding, it hasn't been updated in eons so...

Basically, even though it's the instructorembedding and/or Langchain's peoples' responsibilities to update their code in compliance with sentence-transformers, I'm asking if sentence-transformers would accommodate them and provide a fix in its source code instead?

The benefit would be that Instructor models would work with newer versions of the sentence-transformers library out of the box and people like me could still use pip install instructorembedding instead relying on a specific branch of an unofficial fork of the instructorembedding repo. Doesn't hurt to ask, right?

Thanks again. Please let me know if there's a way I can contribute.

BBC-Esq commented 8 months ago

FINALLY, regarding the error of not being able to load a model locally, I finally solved this issue by using the cache_folder parameter from langchain specified here:

https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceInstructEmbeddings.html#langchain_community.embeddings.huggingface.HuggingFaceInstructEmbeddings

I assume that this connects with the cache_folder parameter within sentence-transformers here:

https://www.sbert.net/docs/package_reference/SentenceTransformer.html

So this narrow issue, at least, seems solved. Just thought others might want to know.

SilasMarvin commented 7 months ago

The fix for this just got merged into Instructor Embedding: https://github.com/xlang-ai/instructor-embedding/commit/5cca65eb0ed78ab354b086a5386fb2c528809caa