Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.56k stars 2.78k forks source link

azure-ai-generative[synthetic] Long-running simulator job fails if AOAI connection reset by peer #34961

Open dratcliffe-microsoft opened 6 months ago

dratcliffe-microsoft commented 6 months ago

Describe the bug A long-running an instance of azure.ai.generative.synthetic.simulator.Simulator.simulate_asyc(...) with 1000s of chat templates (mock user profiles) against AOAI GPT-4 dies if the underlying aiohttp-based connection is reset by peer.

To Reproduce Difficult to reproduce outside of the customer's environment, and unclear which peer caused the reset (AML CI running this package, or the AOAI GPT-4 endpoint).

Expected behavior I expected this exception to be caught, logged, and the connection re-tried. Instead, the Simulator.simulate_asyc(...) call terminates early, and fails to complete all conversations.

Screenshots

Stacktrace to the exception which caused Simulator.simulate_asyc(...) to terminate early:

...
File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_conversation/conversation.py", line 61, in simulate_conversation
    (first_response, request, _, full_response) = await bots[0].generate_response(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_conversation/conversation_bot.py", line 142, in generate_response
    response = await self.model.get_conversation_completion(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_model_tools/models.py", line 586, in get_conversation_completion
    return await self.request_api(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/azure/ai/generative/synthetic/simulator/_model_tools/models.py", line 495, in request_api
    async with session.post(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 149, in __aenter__
    return await self._do_request()
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 138, in _do_request
    raise e
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp_retry/client.py", line 100, in _do_request
    response: ClientResponse = await self._request_func(
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/client.py", line 605, in _request
    await resp.start(conn)
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 966, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/anaconda/envs/cba2/lib/python3.10/site-packages/aiohttp/streams.py", line 622, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

Additional context We (ISE) are currently building an evaluation framework using this package with a customer to stress-test their new AOAI-based CoPilot chat-bot. Early termination of the simulator means the pipeline to generate their synthetic evaluation data is breaking.

@xiaolul

xiangyan99 commented 6 months ago

Thanks for the feedback, we’ll investigate asap.