Index building exceeds the context limit

zkx06111 commented 2 months ago

I was trying to reproduce the results on SWE-Bench and got the invalid request error for exceeding maximum context after ~30 instances.

Have you run into this issue while evaluating on SWE-Bench? Do you think we can just truncate the code to go around it?

  File "/export/home/envs/mt/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py", line 180, in get_embeddings
    data = client.embeddings.create(input=list_of_text, model=engine, **kwargs).data
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/envs/mt/lib/python3.11/site-packages/openai/resources/embeddings.py", line 114, in create
    return self._post(
           ^^^^^^^^^^^
  File "/export/home/envs/mt/lib/python3.11/site-packages/openai/_base_client.py", line 1240, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/envs/mt/lib/python3.11/site-packages/openai/_base_client.py", line 921, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/export/home/envs/mt/lib/python3.11/site-packages/openai/_base_client.py", line 1020, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 11043 tokens (11043 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}
Generating embeddings:  39%|████████████████████████████████████████████████████                                                                                   | 5999/15560 [02:03<03:16, 48.56it/s]

aorwall commented 2 months ago

I used voyage-code-2 to embed the repos before running the evaluations. I encountered this issue and added truncation=True to the voyage config. But it's probably needed for text-embedding-3-small as well then.

If you want to reproduce the exact flow I've used you can use voyage-code-2 instead as it gives somewhat better results when I evaluated only the vector retrieval solution. I'm not sure how big the difference is for the whole flow though as I never run it with text-embedding-3-small .

As it's kind of time consuming to created these indexes I plan to upload them somewhere also so it's easier to reproduce the results.

zkx06111 commented 2 months ago

Thanks for the quick response! I'll use voyage then. Also, I think the OpenAIEmbedding in llama index does not have truncation as a parameter.

aorwall commented 2 months ago

I pushed the notebook I used for ingestion https://github.com/aorwall/moatless-tools/blob/main/notebooks/ingest.ipynb Not sure it works properly though as I've done some refactorings lately.

One thing that will decrease indexing time is to sort the instances by date to get as few changes between each commit ( instances = sorted(instances, key=lambda x: x["created_at"])

zkx06111 commented 2 months ago

by the way do you have any idea why the epicsplitter is not preventing this from happening?

aorwall / moatless-tools

Index building exceeds the context limit #5