bespokelabsai / curator

Apache License 2.0
14 stars 1 forks source link

Write iteratively to batch_objects.jsonl for recovery and catch openai.APITimeoutError #120

Open RyanMarten opened 2 hours ago

RyanMarten commented 2 hours ago

I'm seeing a bunch of the timeouts ^ as we try to upload all the batches: [36m(_Completions pid=125780, ip=10.120.0.7)[0m INFO:openai._base_client:Retrying request to /files in 0.839617 seconds [36m(_Completions pid=125780, ip=10.120.0.7)[0m INFO:openai._base_client:Retrying request to /files in 0.884488 seconds [36m(_Completions pid=125780, ip=10.120.0.7)[0m INFO:openai._base_client:Retrying request to /files in 0.821620 seconds

This is what killed the job openai.APITimeoutError: Request timed out. raise APITimeoutError(request=request) from err File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1591, in _request return await self._retry_request( File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1581, in _request return await self._retry_request( File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1581, in _request return await self._request( File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1533, in request return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/_base_client.py", line 1839, in post return await self._post( File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/8f9a6c08a6f7b36cef5b248cf848c00d3b8e4aef/virtualenv/lib/python3.10/site-packages/openai/resources/files.py", line 443, in create batch_file_upload = await async_client.files.create(

There are 100+ batches successfully submitted and in the dashboard…. I wonder if we can recover and just use that.

When this happens, no batch_objects.jsonl. In the cache. This should be written during the batch creation. So if some of the batch submission fails then we still have partial and can use that

RyanMarten commented 2 hours ago

Also check the logic here....

now if we are iteratively writing batch_objects.jsonl, just checking that it exists doesn't mean we can skip batch submission (we need to instead resume by looking at the metadata of which batches have been submitted and which ones haven't)