ORNL / flowcept

Runtime data integration system that empowers any data processing system to capture and query workflow provenance using data observability.
MIT License
1 stars 3 forks source link

Losing redis connection #127

Open renan-souza opened 3 months ago

renan-souza commented 3 months ago

Sometimes, fortunately only rarely with the LLM experiment, we get the error below. We need to debug it to plan what to do. One possibility is simply to retry the connection and the failed request until it makes it. Today, if this error happens, we are likely losing data.

[flowcept][ERROR][frontier06306.frontier.olcf.ornl.gov][pid=61095][thread=140733193385728][function=_start][Connection closed by server.] Traceback (most recent call last): File "/lustre/orion/stf219/scratch/souzar/flowcept/flowcept/flowceptor/consumers/document_inserter.py", line 199, in _start for message in pubsub.listen(): File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1653, in listen response = self.handle_message(self.parse_response(block=True)) File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1531, in parse_response response = self._execute(conn, try_read) File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1507, in _execute return conn.retry.call_with_retry( File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/retry.py", line 49, in call_with_retry fail(error) File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1509, in lambda error: self._disconnect_raise_connect(conn, error), File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1496, in _disconnect_raise_connect raise error File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/retry.py", line 46, in call_with_retry return do() File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1508, in lambda: command(*args, **kwargs), File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1529, in try_read return conn.read_response() File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 848, in read_response response = self._parser.read_response(disable_decoding=disable_decoding) File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 335, in read_response result = self._read_response(disable_decoding=disable_decoding) File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 383, in _read_response response = [ File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 384, in self._read_response(disable_decoding=disable_decoding) File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 377, in _read_response response = self._buffer.read(length) File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 230, in read self._read_from_socket(length - self.length) File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 195, in _read_from_socket raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR) redis.exceptions.ConnectionError: Connection closed by server.

renan-souza commented 3 months ago

I found that it is an intermittent error that happens on Frontier, likely due to network issues. Anyhow, we might need to consider handling this failure better than just missing the data.