Closed lambdaofgod closed 3 years ago
Hi @lambdaofgod I see the message Duplicate Documents: Document with id 'eafc79d9ec2b51963d47475c73c84fc9' already exists in index 'document'
so there is definitely at least one document stored in your document store. Could please post one document as an example? You can abbreviate the content in the different fields. I'm just wondering why there is text
and content
. With one of the most recent changes in the master branch, we are now using content
but before that, we were using text
. I would like to try to reproduce your error. Which version of haystack are you using? Latest master branch or one of the releases?
I've deleted text
field and it still doesnt' work.
My hypothesis is that something is wrong with the model - it's a Roberta model for language modeling, it wasn't trained for NLI or anything that explicitly makes it output a vector for sentence.
Documents (note that I used pprint so the content
field got split by line:
[
{'content': 'class ImportGraph():\n'
" ' Importing and running isolated TF graph '\n"
'\n'
' def __init__(self, loc, model_path=None):\n'
' self.graph = tf.Graph()\n'
' self.sess = tf.Session(graph=self.graph)\n'
' with self.graph.as_default():\n'
' ckpt = tf.train.get_checkpoint_state(loc)\n'
' if (ckpt and ckpt.model_checkpoint_path):\n'
' if (model_path is None):\n'
' ckpt_name = ckpt.model_checkpoint_path\n'
' else:\n'
" ckpt_name = ((loc + '/') + model_path)\n"
' self.saver = '
"tf.train.import_meta_graph((ckpt_name + '.meta'))\n"
' self.saver.restore(self.sess, ckpt_name)\n'
'\n'
' def get_variable_value(self, var_name):\n'
' with self.graph.as_default():\n'
' vars = [v for v in tf.trainable_variables() if '
'(v.name == var_name)][0]\n'
' values = self.sess.run(vars)\n'
' return values\n'
'\n'
' def close_session(self):\n'
' self.sess.close()',
'function_name': 'ImportGraph',
'path': 'translate/import_graph.py',
'repo_name': 'trangvu/ape-npi'},
{'content': 'def load_checkpoint(sess, checkpoint_dir, filename, variables):\n'
' if (filename is not None):\n'
" ckpt_file = ((checkpoint_dir + '/') + filename)\n"
" utils.log('reading model parameters from "
"{}'.format(ckpt_file))\n"
' tf.train.Saver(variables).restore(sess, ckpt_file)\n'
" utils.debug('retrieved parameters "
"({})'.format(len(variables)))\n"
' for var in sorted(variables, key=(lambda var: '
'var.name)):\n'
" utils.debug(' {} {}'.format(var.name, "
'var.get_shape()))',
'function_name': 'load_checkpoint',
'path': 'translate/import_graph.py',
'repo_name': 'trangvu/ape-npi'}
]
Raw documents for reproduction:
[{'repo_name': 'trangvu/ape-npi', 'path': 'translate/import_graph.py', 'function_name': 'ImportGraph', 'content': "class ImportGraph():\n ' Importing and running isolated TF graph '\n\n def __init__(self, loc, model_path=None):\n self.graph = tf.Graph()\n self.sess = tf.Session(graph=self.graph)\n with self.graph.as_default():\n ckpt = tf.train.get_checkpoint_state(loc)\n if (ckpt and ckpt.model_checkpoint_path):\n if (model_path is None):\n ckpt_name = ckpt.model_checkpoint_path\n else:\n ckpt_name = ((loc + '/') + model_path)\n self.saver = tf.train.import_meta_graph((ckpt_name + '.meta'))\n self.saver.restore(self.sess, ckpt_name)\n\n def get_variable_value(self, var_name):\n with self.graph.as_default():\n vars = [v for v in tf.trainable_variables() if (v.name == var_name)][0]\n values = self.sess.run(vars)\n return values\n\n def close_session(self):\n self.sess.close()"}, {'repo_name': 'trangvu/ape-npi', 'path': 'translate/import_graph.py', 'function_name': 'load_checkpoint', 'content': "def load_checkpoint(sess, checkpoint_dir, filename, variables):\n if (filename is not None):\n ckpt_file = ((checkpoint_dir + '/') + filename)\n utils.log('reading model parameters from {}'.format(ckpt_file))\n tf.train.Saver(variables).restore(sess, ckpt_file)\n utils.debug('retrieved parameters ({})'.format(len(variables)))\n for var in sorted(variables, key=(lambda var: var.name)):\n utils.debug(' {} {}'.format(var.name, var.get_shape()))"}]
Hi @lambdaofgod , I am using the latest master, and running this code.
docs = [{'repo_name': 'trangvu/ape-npi', 'path': 'translate/import_graph.py', 'function_name': 'ImportGraph', 'content': "class ImportGraph():\n ' Importing and running isolated TF graph '\n\n def __init__(self, loc, model_path=None):\n self.graph = tf.Graph()\n self.sess = tf.Session(graph=self.graph)\n with self.graph.as_default():\n ckpt = tf.train.get_checkpoint_state(loc)\n if (ckpt and ckpt.model_checkpoint_path):\n if (model_path is None):\n ckpt_name = ckpt.model_checkpoint_path\n else:\n ckpt_name = ((loc + '/') + model_path)\n self.saver = tf.train.import_meta_graph((ckpt_name + '.meta'))\n self.saver.restore(self.sess, ckpt_name)\n\n def get_variable_value(self, var_name):\n with self.graph.as_default():\n vars = [v for v in tf.trainable_variables() if (v.name == var_name)][0]\n values = self.sess.run(vars)\n return values\n\n def close_session(self):\n self.sess.close()"}, {'repo_name': 'trangvu/ape-npi', 'path': 'translate/import_graph.py', 'function_name': 'load_checkpoint', 'content': "def load_checkpoint(sess, checkpoint_dir, filename, variables):\n if (filename is not None):\n ckpt_file = ((checkpoint_dir + '/') + filename)\n utils.log('reading model parameters from {}'.format(ckpt_file))\n tf.train.Saver(variables).restore(sess, ckpt_file)\n utils.debug('retrieved parameters ({})'.format(len(variables)))\n for var in sorted(variables, key=(lambda var: var.name)):\n utils.debug(' {} {}'.format(var.name, var.get_shape()))"}]
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import EmbeddingRetriever
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)
retriever = EmbeddingRetriever(embedding_model="microsoft/codebert-base", document_store=document_store)
document_store.update_embeddings(retriever)
My console output looks like this:
Currently no support in Processor for returning problematic ids
ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
Using devices: CUDA
Number of GPUs: 1
Updating embeddings for 2 docs ...
Updating Embedding: 0%| | 0/2 [00:00<?, ? docs/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.88 Batches/s]
Documents Processed: 10000 docs [00:00, 111137.42 docs/s]
And when I call document_store.get_all_documents(return_embeddings=True)
I see 2 documents.
Could you try this and see if you get the same results?
I tried both installing from master (pip install git+https://github.com/deepset-ai/haystack
) and package from pip, and for both of them
document_store.get_all_documents(return_embeddings=True)
gives
TypeError: get_all_documents() got an unexpected keyword argument 'return_embeddings'
Hi @lambdaofgod I think that should be return_embedding
without the s
: https://github.com/deepset-ai/haystack/blob/8082549663f741d966791625ed7b78f0e2113c3b/haystack/document_stores/memory.py#L326
I try to use EmbeddingRetriever for embeddings for my documents
I supply
docs
as a list of dicts with the following fields:dict_keys(['repo', 'tasks', 'Unnamed: 0', 'repo_name', 'path', 'function_name', 'text', 'content'])
When I run the code
It outputs (here I used fewer documents because it takes some time)
And then I try to check embeddings it looks like nothing was added:
document_store.get_embedding_count()
Why is this so? Is this a wrong way of using EmbeddingRetriever?
If yes, how to use it?
Also, why it seems like the
update_embeddings
computes the embeddings? The progbar clearly depends on the number of supplied documents