aryn-ai / sycamore

🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.
https://sycamore.readthedocs.io
Apache License 2.0
315 stars 33 forks source link

Non-deterministic failure for test_pdf_to_opensearch #20

Open bsowell opened 1 year ago

bsowell commented 1 year ago

Describe the bug I have seen some non-deterministic failures in the test_pdf_to_opensearch test. The issue appears to be related to how guidance is returning results. In some cases, the returned map does not contain an answer key:

E                         File "/home/runner/work/sycamore/sycamore/sycamore/execution/transforms/entity/entity_extractor.py", line 41, in extract_entity
E                           document.properties.update({f"{self._entity_to_extract}": entities["answer"]})
E                                                                                     ~~~~~~~~^^^^^^^^^^
E                         File "/home/runner/.cache/pypoetry/virtualenvs/sycamore-QdKr-uPZ-py3.11/lib/python3.11/site-packages/guidance/_program.py", line 470, in __getitem__
E                           return self._variables[key]
E                                  ~~~~~~~~~~~~~~~^^^^^
E                       KeyError: 'answer'

From https://github.com/aryn-ai/sycamore/actions/runs/6176295926/job/16765129778#step:10:312

This looks to be non-deterministic, because the very next run of the exact same workflow succeeded:

The code is failing from here: https://github.com/aryn-ai/sycamore/blob/a076d86644375a60f0c527b83e96d5d3e12345a5/sycamore/execution/transforms/entity/entity_extractor.py#L41 We will likely need to add some additional logging to get to the bottom of this.

bsowell commented 12 months ago

Still happening: https://github.com/aryn-ai/sycamore/actions/runs/6293759018/job/17084907000. It is infrequent enough, I suspect it may be dependent on the result result returned by OpenAI.

bsowell commented 12 months ago

We traced this to throttling from OpenAI. Looks like we are already doing basic retries, but we should see if there is anything better we can do and at the very least surface the error more clearly

ChillOrb commented 11 months ago

Hello @bsowell,

I've been investigating the non-deterministic failures you've mentioned. I noticed that test_pdf_to_opensearch does not pass llm_kwargs here, and OpenAI is ultimately set here. I believe you're referring to the retry mechanism here?

So, my queries are:

  1. Would it be beneficial to add a similar mechanism to _generate_using_guidance?

To enhance our retry mechanism, I'm considering using This guide

I would appreciate your thoughts on this. Please let me know if I am heading in the right direction.

ChillOrb commented 11 months ago

Also, i think this is related to https://github.com/aryn-ai/sycamore/issues/36

bsowell commented 11 months ago

Hey @ChillOrb! Thanks so much for taking a look.

So guidance, which is another library for interacting with LLMs, does have some retries enabled by default (e.g. see the default parameter here). Even with that, we still sometimes see timeouts. I do think eventually we want to move off of guidance and just use openai directly -- we aren't getting too much value, and I think we will want to customize the behavior, for example by using some of the techniques you linked. One place to start might be to see if we are doing the right thing in terms of retries for the non-guidance based access.

I do think this is slightly different than #36. That issue is about OpenSearch. When writing to OpenSearch we can sometimes overwhelm the cluster and need to add mechanisms to backoff. Here we are hitting issues access OpenAI. Kind of unfortunate in this project that everything starts with Open :).

ChillOrb commented 11 months ago

Hey @bsowell ,

Thanks for clarifying,

Code here looks fine to me. Still a few things i think could be addressed and would love to get your opinion,

  1. Deciding on number of max retries (currently 3)
  2. Using exponential backoff which I think can be useful since I predict this error can be more frequent as one scales up.
  3. Also will it be a good idea to check for content length here
  4. What are your views on creating a custom retry backoff mechanism as mentioned here and calculate delay and retries based on headers here ?
  5. Custom retry can also help in surfacing the error more clearly.
  6. Is there a need for batching request parallelly? Ref: here .

Let me know if this is something that needs to be worked on right now.