deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.68k stars 1.92k forks source link

Confusing behavior of document store update_embeddings with EmbeddingRetriever #1701

Closed lambdaofgod closed 3 years ago

lambdaofgod commented 3 years ago

I try to use EmbeddingRetriever for embeddings for my documents

I supply docs as a list of dicts with the following fields: dict_keys(['repo', 'tasks', 'Unnamed: 0', 'repo_name', 'path', 'function_name', 'text', 'content'])

When I run the code

from haystack.document_stores import InMemoryDocumentStore
from haystack.retriever.dense import EmbeddingRetriever 
document_store = InMemoryDocumentStore()

document_store.write_documents(docs)
retriever = EmbeddingRetriever(embedding_model="microsoft/codebert-base", document_store=document_store)
document_store.update_embeddings(retriever)

It outputs (here I used fewer documents because it takes some time)

Duplicate Documents: Document with id 'eafc79d9ec2b51963d47475c73c84fc9' already exists in index 'document' Currently no support in Processor for returning problematic ids ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow. Updating Embedding: 0%| | 0/1 [00:00<?, ? docs/s] Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 91.63 Batches/s] Documents Processed: 10000 docs [00:00, 641262.25 docs/s]

And then I try to check embeddings it looks like nothing was added: document_store.get_embedding_count()

0

Why is this so? Is this a wrong way of using EmbeddingRetriever?

If yes, how to use it?

Also, why it seems like the update_embeddings computes the embeddings? The progbar clearly depends on the number of supplied documents

julian-risch commented 3 years ago

Hi @lambdaofgod I see the message Duplicate Documents: Document with id 'eafc79d9ec2b51963d47475c73c84fc9' already exists in index 'document' so there is definitely at least one document stored in your document store. Could please post one document as an example? You can abbreviate the content in the different fields. I'm just wondering why there is text and content. With one of the most recent changes in the master branch, we are now using content but before that, we were using text. I would like to try to reproduce your error. Which version of haystack are you using? Latest master branch or one of the releases?

lambdaofgod commented 3 years ago

I've deleted text field and it still doesnt' work. My hypothesis is that something is wrong with the model - it's a Roberta model for language modeling, it wasn't trained for NLI or anything that explicitly makes it output a vector for sentence.

Documents (note that I used pprint so the content field got split by line:

[
{'content': 'class ImportGraph():\n'
            "    '  Importing and running isolated TF graph '\n"
            '\n'
            '    def __init__(self, loc, model_path=None):\n'
            '        self.graph = tf.Graph()\n'
            '        self.sess = tf.Session(graph=self.graph)\n'
            '        with self.graph.as_default():\n'
            '            ckpt = tf.train.get_checkpoint_state(loc)\n'
            '            if (ckpt and ckpt.model_checkpoint_path):\n'
            '                if (model_path is None):\n'
            '                    ckpt_name = ckpt.model_checkpoint_path\n'
            '                else:\n'
            "                    ckpt_name = ((loc + '/') + model_path)\n"
            '                self.saver = '
            "tf.train.import_meta_graph((ckpt_name + '.meta'))\n"
            '                self.saver.restore(self.sess, ckpt_name)\n'
            '\n'
            '    def get_variable_value(self, var_name):\n'
            '        with self.graph.as_default():\n'
            '            vars = [v for v in tf.trainable_variables() if '
            '(v.name == var_name)][0]\n'
            '            values = self.sess.run(vars)\n'
            '        return values\n'
            '\n'
            '    def close_session(self):\n'
            '        self.sess.close()',
 'function_name': 'ImportGraph',
 'path': 'translate/import_graph.py',
 'repo_name': 'trangvu/ape-npi'},
{'content': 'def load_checkpoint(sess, checkpoint_dir, filename, variables):\n'
            '    if (filename is not None):\n'
            "        ckpt_file = ((checkpoint_dir + '/') + filename)\n"
            "        utils.log('reading model parameters from "
            "{}'.format(ckpt_file))\n"
            '        tf.train.Saver(variables).restore(sess, ckpt_file)\n'
            "        utils.debug('retrieved parameters "
            "({})'.format(len(variables)))\n"
            '        for var in sorted(variables, key=(lambda var: '
            'var.name)):\n'
            "            utils.debug('  {} {}'.format(var.name, "
            'var.get_shape()))',
 'function_name': 'load_checkpoint',
 'path': 'translate/import_graph.py',
 'repo_name': 'trangvu/ape-npi'}
]

Raw documents for reproduction:

[{'repo_name': 'trangvu/ape-npi', 'path': 'translate/import_graph.py', 'function_name': 'ImportGraph', 'content': "class ImportGraph():\n    '  Importing and running isolated TF graph '\n\n    def __init__(self, loc, model_path=None):\n        self.graph = tf.Graph()\n        self.sess = tf.Session(graph=self.graph)\n        with self.graph.as_default():\n            ckpt = tf.train.get_checkpoint_state(loc)\n            if (ckpt and ckpt.model_checkpoint_path):\n                if (model_path is None):\n                    ckpt_name = ckpt.model_checkpoint_path\n                else:\n                    ckpt_name = ((loc + '/') + model_path)\n                self.saver = tf.train.import_meta_graph((ckpt_name + '.meta'))\n                self.saver.restore(self.sess, ckpt_name)\n\n    def get_variable_value(self, var_name):\n        with self.graph.as_default():\n            vars = [v for v in tf.trainable_variables() if (v.name == var_name)][0]\n            values = self.sess.run(vars)\n        return values\n\n    def close_session(self):\n        self.sess.close()"}, {'repo_name': 'trangvu/ape-npi', 'path': 'translate/import_graph.py', 'function_name': 'load_checkpoint', 'content': "def load_checkpoint(sess, checkpoint_dir, filename, variables):\n    if (filename is not None):\n        ckpt_file = ((checkpoint_dir + '/') + filename)\n        utils.log('reading model parameters from {}'.format(ckpt_file))\n        tf.train.Saver(variables).restore(sess, ckpt_file)\n        utils.debug('retrieved parameters ({})'.format(len(variables)))\n        for var in sorted(variables, key=(lambda var: var.name)):\n            utils.debug('  {} {}'.format(var.name, var.get_shape()))"}]
brandenchan commented 3 years ago

Hi @lambdaofgod , I am using the latest master, and running this code.

docs = [{'repo_name': 'trangvu/ape-npi', 'path': 'translate/import_graph.py', 'function_name': 'ImportGraph', 'content': "class ImportGraph():\n    '  Importing and running isolated TF graph '\n\n    def __init__(self, loc, model_path=None):\n        self.graph = tf.Graph()\n        self.sess = tf.Session(graph=self.graph)\n        with self.graph.as_default():\n            ckpt = tf.train.get_checkpoint_state(loc)\n            if (ckpt and ckpt.model_checkpoint_path):\n                if (model_path is None):\n                    ckpt_name = ckpt.model_checkpoint_path\n                else:\n                    ckpt_name = ((loc + '/') + model_path)\n                self.saver = tf.train.import_meta_graph((ckpt_name + '.meta'))\n                self.saver.restore(self.sess, ckpt_name)\n\n    def get_variable_value(self, var_name):\n        with self.graph.as_default():\n            vars = [v for v in tf.trainable_variables() if (v.name == var_name)][0]\n            values = self.sess.run(vars)\n        return values\n\n    def close_session(self):\n        self.sess.close()"}, {'repo_name': 'trangvu/ape-npi', 'path': 'translate/import_graph.py', 'function_name': 'load_checkpoint', 'content': "def load_checkpoint(sess, checkpoint_dir, filename, variables):\n    if (filename is not None):\n        ckpt_file = ((checkpoint_dir + '/') + filename)\n        utils.log('reading model parameters from {}'.format(ckpt_file))\n        tf.train.Saver(variables).restore(sess, ckpt_file)\n        utils.debug('retrieved parameters ({})'.format(len(variables)))\n        for var in sorted(variables, key=(lambda var: var.name)):\n            utils.debug('  {} {}'.format(var.name, var.get_shape()))"}]

from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import EmbeddingRetriever
document_store = InMemoryDocumentStore()

document_store.write_documents(docs)
retriever = EmbeddingRetriever(embedding_model="microsoft/codebert-base", document_store=document_store)
document_store.update_embeddings(retriever)

My console output looks like this:

Currently no support in Processor for returning problematic ids
ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
Using devices: CUDA
Number of GPUs: 1
Updating embeddings for 2 docs ...
Updating Embedding:   0%|          | 0/2 [00:00<?, ? docs/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.88 Batches/s]
Documents Processed: 10000 docs [00:00, 111137.42 docs/s]    

And when I call document_store.get_all_documents(return_embeddings=True) I see 2 documents.

Could you try this and see if you get the same results?

lambdaofgod commented 3 years ago

I tried both installing from master (pip install git+https://github.com/deepset-ai/haystack) and package from pip, and for both of them

document_store.get_all_documents(return_embeddings=True) gives

TypeError: get_all_documents() got an unexpected keyword argument 'return_embeddings'

julian-risch commented 3 years ago

Hi @lambdaofgod I think that should be return_embedding without the s: https://github.com/deepset-ai/haystack/blob/8082549663f741d966791625ed7b78f0e2113c3b/haystack/document_stores/memory.py#L326