ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
14.33k stars 1.17k forks source link

Context length exceeded #543

Closed Cdingram closed 6 days ago

Cdingram commented 1 month ago

Describe the bug When doing some crawls, I get the following error: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 129936 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}} To Reproduce from scrapegraphai.graphs import SearchGraph graph_config = { "llm": { "api_key": <openai_key>, "model": "gpt-4o-mini" }, "max_results": 6 } search_graph = SearchGraph(prompt="Get me urls to pepe wearing a tux memes", config=graph_config) result = search_graph.run()

Don't ask it was a customer query lol.

f-aguzzi commented 3 weeks ago

We made a temporary fix that should solve your issue. It's been released on version 1.14.

Let us know if it works now. Thanks for taking interest in our library and for reporting this bug.

tm-robinson commented 2 weeks ago

I have this issue when using the ScriptCreatorGraph when the page being accessed and passed to the LLM is very long e.g.:

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 228910 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

Is the same fix needed for ScriptCreatorGraph as well?

f-aguzzi commented 2 weeks ago

ScriptCreatorGraph is the only graph that does not support chunking at the moment. If the request is too long, it just won't work. I don't know what was the design principle behind this limitation, but unfortunately it's there.

tm-robinson commented 2 weeks ago

Thanks very much. I had a look at the GenerateScraperNode module and compared it with the GenerateAnswerNode module. I can see that, as you say, GenerateScraperNode simply doesn't support chunks. (Currently the ScriptCreatorGraph won't attempt to provide it with chunks anyway).

I suspect the reason for this may be that, whilst chunking up the content on a page and asking the LLM to convert each chunk to structured data would work (as you can then just combine each of the results together into a single result and return it to the user), the same approach doesn't work when generating a script as you would then be left with one script per chunk, with potentially each script being different if the structure of the page in each chunk was to be different.

However, I think it may be worth trying a few different approaches to solve this:

Do those sound like they could work? I am happy to try them out when I have some free time.

VinciGit00 commented 2 weeks ago

hi @tm-robinson if you want you ca update the generate scraper node

tm-robinson commented 1 week ago

@VinciGit00 I've added a PR for the simpler way of fixing GenerateScraperNode. Will work on the more complex solution at some point soon hopefully.

f-aguzzi commented 6 days ago

Closing this as fixes (both permanent and temporary) were published both on beta and on stable.