Context length exceeded

Cdingram commented 1 month ago

Describe the bug When doing some crawls, I get the following error: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 129936 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}} To Reproduce from scrapegraphai.graphs import SearchGraph graph_config = { "llm": { "api_key": <openai_key>, "model": "gpt-4o-mini" }, "max_results": 6 } search_graph = SearchGraph(prompt="Get me urls to pepe wearing a tux memes", config=graph_config) result = search_graph.run()

Don't ask it was a customer query lol.

f-aguzzi commented 3 weeks ago

We made a temporary fix that should solve your issue. It's been released on version 1.14.

Let us know if it works now. Thanks for taking interest in our library and for reporting this bug.

tm-robinson commented 2 weeks ago

I have this issue when using the ScriptCreatorGraph when the page being accessed and passed to the LLM is very long e.g.:

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 228910 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

Is the same fix needed for ScriptCreatorGraph as well?

f-aguzzi commented 2 weeks ago

ScriptCreatorGraph is the only graph that does not support chunking at the moment. If the request is too long, it just won't work. I don't know what was the design principle behind this limitation, but unfortunately it's there.

tm-robinson commented 2 weeks ago

Thanks very much. I had a look at the GenerateScraperNode module and compared it with the GenerateAnswerNode module. I can see that, as you say, GenerateScraperNode simply doesn't support chunks. (Currently the ScriptCreatorGraph won't attempt to provide it with chunks anyway).

I suspect the reason for this may be that, whilst chunking up the content on a page and asking the LLM to convert each chunk to structured data would work (as you can then just combine each of the results together into a single result and return it to the user), the same approach doesn't work when generating a script as you would then be left with one script per chunk, with potentially each script being different if the structure of the page in each chunk was to be different.

However, I think it may be worth trying a few different approaches to solve this:

The most naive approach could be to simply get the GenerateScraperNode to use the first chunk, and accept that the script may not cater for data that falls outside of that chunk if it is formatted differently. I think in many cases this would work fine, as in a simple example of a large table, long set of comments, etc, the HTML structure of the later parts of the page is probably very similar to the earlier parts.
Perhaps a more advanced approach could be to generate a script for each chunk and then make a subsequent call to the LLM to ask it to create a combined script that implements all the scraping functionality across all the input scripts together?

Do those sound like they could work? I am happy to try them out when I have some free time.

VinciGit00 commented 2 weeks ago

hi @tm-robinson if you want you ca update the generate scraper node

tm-robinson commented 1 week ago

@VinciGit00 I've added a PR for the simpler way of fixing GenerateScraperNode. Will work on the more complex solution at some point soon hopefully.

f-aguzzi commented 6 days ago

Closing this as fixes (both permanent and temporary) were published both on beta and on stable.

ScrapeGraphAI / Scrapegraph-ai

Context length exceeded #543