deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17k stars 1.86k forks source link

Dense Passage Retrieval (ConnectionTimeoutError) #644

Closed Freemanlabs closed 3 years ago

Freemanlabs commented 3 years ago

With the same settings for Elasticsearch, I can successfully retrieve prediction answers from BM25 and TFIDF, However, When I try with DPR, I get ConnectionTimeoutError

How do I resolve this?

Freemanlabs commented 3 years ago

First I get RequestError: 400... if I try it again, then I see ConnectionTimeoutError

tholor commented 3 years ago

Please try to increase the timeout for elasticsearch via:

es_store = ElasticsearchDocumentStore(..., timeout=3000)

Are you running this on Colab? Elasticsearch might be very slow there as Colab only provides a single CPU core...

Freemanlabs commented 3 years ago

I see. I will try that. Although, my Runtime type is GPU. meaning I am using a GPU on Colab...

Also, do you have any response to why I get this RequestError: 400... initially?

tholor commented 3 years ago

Although, my Runtime type is GPU

Elasticsearch can not benefit from GPU and will always run just on CPU.

do you have any response to why I get this RequestError: 400...

Can you please provide more context / a script to reproduce this error? My first intuition would be that elasticsearch is probably still starting / or still busy with indexing the added documents / embeddings.

Freemanlabs commented 3 years ago

this is the line causing issues prediction = finder.get_answers(question=que["question"], top_k_retriever=10, top_k_reader=5)

This is the error I get when I run the above line for the firstime after starting and conneting to my server RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error')

Then after I get this error, I simply run the cell again, then I get this ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=300))

I did as you directed timeout=3000. still no luck

lalitpagaria commented 3 years ago

There is possibility that any one of following may cause it (https://github.com/elastic/elasticsearch/issues/8084) -

Could you please share end to end logs to debug more.

Freemanlabs commented 3 years ago

Logs: This is all I see (from colab):

`timeout Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 420 # Otherwise it looks like a bug in the code. --> 421 six.raise_from(e, None) 422 except (SocketTimeout, BaseSSLError, SocketError) as e:

21 frames timeout: timed out

During handling of the above exception, another exception occurred:

ReadTimeoutError Traceback (most recent call last) ReadTimeoutError: HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=300)

During handling of the above exception, another exception occurred:

ConnectionTimeout Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/elasticsearch/connection/http_urllib3.py in perform_request(self, method, url, params, body, timeout, ignore, headers) 255 raise SSLError("N/A", str(e), e) 256 if isinstance(e, ReadTimeoutError): --> 257 raise ConnectionTimeout("TIMEOUT", str(e), e) 258 raise ConnectionError("N/A", str(e), e) 259

ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=300))`

lalitpagaria commented 3 years ago

@Freemanlabs Thank you for update. Could you please expend these 21 frames and share full stack-trace.

ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=300))`

Also in some code flow you timeout is still 300 not 3000.

lalitpagaria commented 3 years ago

Also be aware that Google colab have disk limitation of 108GB out of it only 75GB is available to user. https://neptune.ai/blog/google-colab-dealing-with-files#:~:text=Also%2C%20Colab%20has%20a%20disk,like%20image%20or%20video%20data.

Freemanlabs commented 3 years ago

Could you please expend these 21 frames and share full stack-trace.

Expanded frames:

`/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 415 try: --> 416 httplib_response = conn.getresponse() 417 except BaseException as e:

/usr/lib/python3.6/http/client.py in getresponse(self) 1372 try: -> 1373 response.begin() 1374 except ConnectionError:

/usr/lib/python3.6/http/client.py in begin(self) 310 while True: --> 311 version, status, reason = self._read_status() 312 if status != CONTINUE:

/usr/lib/python3.6/http/client.py in _read_status(self) 271 def _read_status(self): --> 272 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") 273 if len(line) > _MAXLINE:

/usr/lib/python3.6/socket.py in readinto(self, b) 585 try: --> 586 return self._sock.recv_into(b) 587 except timeout:

timeout: timed out

During handling of the above exception, another exception occurred:

ReadTimeoutError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/elasticsearch/connection/http_urllib3.py in perform_request(self, method, url, params, body, timeout, ignore, headers) 245 response = self.pool.urlopen( --> 246 method, url, body, retries=Retry(False), headers=request_headers, **kw 247 )

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 719 retries = retries.increment( --> 720 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] 721 )

/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 375 # Disabled, indicate to re-raise the error. --> 376 raise six.reraise(type(error), error, _stacktrace) 377

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in reraise(tp, value, tb) 734 raise value.with_traceback(tb) --> 735 raise value 736 finally:

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 671 headers=headers, --> 672 chunked=chunked, 673 )

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 422 except (SocketTimeout, BaseSSLError, SocketError) as e: --> 423 self._raise_timeout(err=e, url=url, timeout_value=read_timeout) 424 raise

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _raise_timeout(self, err, url, timeout_value) 330 raise ReadTimeoutError( --> 331 self, url, "Read timed out. (read timeout=%s)" % timeout_value 332 )

ReadTimeoutError: HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=300)

During handling of the above exception, another exception occurred:

ConnectionTimeout Traceback (most recent call last)

in () ----> 1 prediction = dpr_finder.get_answers(question="who is who?", top_k_retriever=10, top_k_reader=5) /usr/local/lib/python3.6/dist-packages/haystack/finder.py in get_answers(self, question, top_k_reader, top_k_retriever, filters, index) 67 68 # 1) Apply retriever(with optional filters) to get fast candidate documents ---> 69 documents = self.retriever.retrieve(question, filters=filters, top_k=top_k_retriever, index=index) 70 logger.info(f"Got {len(documents)} candidates from retriever") 71 logger.debug(f"Retrieved document IDs: {[doc.id for doc in documents]}") /usr/local/lib/python3.6/dist-packages/haystack/retriever/dense.py in retrieve(self, query, filters, top_k, index) 140 index = self.document_store.index 141 query_emb = self.embed_queries(texts=[query]) --> 142 documents = self.document_store.query_by_embedding(query_emb=query_emb[0], top_k=top_k, filters=filters, index=index) 143 return documents 144 /usr/local/lib/python3.6/dist-packages/haystack/document_store/elasticsearch.py in query_by_embedding(self, query_emb, filters, top_k, index, return_embedding) 572 573 logger.debug(f"Retriever query: {body}") --> 574 result = self.client.search(index=index, body=body, request_timeout=300)["hits"]["hits"] 575 576 documents = [self._convert_es_hit_to_document(hit, adapt_score_for_embedding=True) for hit in result] /usr/local/lib/python3.6/dist-packages/elasticsearch/client/utils.py in _wrapped(*args, **kwargs) 150 if p in kwargs: 151 params[p] = kwargs.pop(p) --> 152 return func(*args, params=params, headers=headers, **kwargs) 153 154 return _wrapped /usr/local/lib/python3.6/dist-packages/elasticsearch/client/__init__.py in search(self, body, index, doc_type, params, headers) 1661 params=params, 1662 headers=headers, -> 1663 body=body, 1664 ) 1665 /usr/local/lib/python3.6/dist-packages/elasticsearch/transport.py in perform_request(self, method, url, headers, params, body) 390 raise e 391 else: --> 392 raise e 393 394 else: /usr/local/lib/python3.6/dist-packages/elasticsearch/transport.py in perform_request(self, method, url, headers, params, body) 363 headers=headers, 364 ignore=ignore, --> 365 timeout=timeout, 366 ) 367 /usr/local/lib/python3.6/dist-packages/elasticsearch/connection/http_urllib3.py in perform_request(self, method, url, params, body, timeout, ignore, headers) 255 raise SSLError("N/A", str(e), e) 256 if isinstance(e, ReadTimeoutError): --> 257 raise ConnectionTimeout("TIMEOUT", str(e), e) 258 raise ConnectionError("N/A", str(e), e) 259 ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=300))` > Also in some code flow you timeout is still 300 not 3000 I don't understand why either. From the documentation you see that `timeout=30` so I do not know what 300 means ` __init__(host: str = "localhost", port: int = 9200, username: str = "", password: str = "", index: str = "document", label_index: str = "label", search_fields: Union[str, list] = "text", text_field: str = "text", name_field: str = "name", embedding_field: str = "embedding", embedding_dim: int = 768, custom_mapping: Optional[dict] = None, excluded_meta_data: Optional[list] = None, faq_question_field: Optional[str] = None, analyzer: str = "standard", scheme: str = "http", ca_certs: bool = False, verify_certs: bool = True, create_index: bool = True, update_existing_documents: bool = False, refresh_type: str = "wait_for", similarity="dot_product", timeout=30, return_embedding: Optional[bool] = True)`
Freemanlabs commented 3 years ago

Another question I have is why is DPR giving issues? BM25 and TFIDF works okay

lalitpagaria commented 3 years ago

Well BM25 and TFIDF do not use script_score query. But DPR needs query_by_embedding hence it use the similarity function to compare document vectors.

Can you manually increase timeout at this place https://github.com/deepset-ai/haystack/blob/master/haystack/document_store/elasticsearch.py#L574 and use that code to test.

If timeout increase also do not solve you issue then you can try DPR with FAISS document store. Otherwise try to use custom plugin, but in order to use it you would need to make few change in the elasticsearch.py in haystack.

@tholor I think we need to add timeout customisation at following place as well instead of hardcoding. https://github.com/deepset-ai/haystack/blob/master/haystack/document_store/elasticsearch.py#L574

Freemanlabs commented 3 years ago

Thank you for your response...

Can you manually increase timeout at this place https://github.com/deepset-ai/haystack/blob/master/haystack/document_store/elasticsearch.py#L574 and use that code to test.

I do not know how to locate this file to change manually. Please assist

Freemanlabs commented 3 years ago

If timeout increase also do not solve you issue then you can try DPR with FAISS document store.

I am trying this FAISSDocumentStore. If I do faiss_document_store.get_document_count() I see 874 (which is the total number of documents I have). But when I pass it to the retriever like so dpr_retriever = DensePassageRetriever(document_store=faiss_document_store) and try to get answers from finder, I get this INFO

12/03/2020 08:26:05 - INFO - haystack.finder - Got 0 candidates from retriever 12/03/2020 08:26:05 - INFO - haystack.finder - Retriever did not return any documents. Skipping reader ...

when I inspect the retriver like so : dpr_retriever.retrieve(query="I would expect the remaining TFC protection to remain protected in both")

All I see is an empty array

Creating Embeddings: 100%|██████████| 1/1 [00:00<00:00, 4.84 Batches/s] []

Too many frustration as a first time user of Haysack I must say. My supervisor feels I am not doing anything, because what he should be getting is results not issues.

tholor commented 3 years ago

Sorry, to hear that you have trouble. From what I see, the main issues were around your usage on Colab (Mounting problem, connection time out ...). We will give our best to simplify the experience there, but as mentioned above, Colab has some severe resource limitations when running heavy external services like Elasticsearch or FAISS.

All I see is an empty array

Did you call document_store.update_embeddings(retriever) as described in this tutorial?

You can also jump on a quick call with one of our engineers if you need further help or share your colab notebook here.

Freemanlabs commented 3 years ago

Thanks for pointing out that resource. I was practically following the Haystack documentation. All seems well now

tholor commented 3 years ago

Great! Closing this now. Feel free to re-open if the problem comes up again...