ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
15.76k stars 1.28k forks source link

Exec Info Misses nested Graph Executions #576

Closed portoaj closed 1 month ago

portoaj commented 2 months ago

Describe the bug If you run a graph such as the SearchGraph, the only outputs from the graph_exec_info are from the SearchGraph, but that doesn't include the child SmartScraperGraph instance used by the GraphIteratorNode. Since the GraphIteratorNode is likely using most of the tokens that the model actually needs, this could lead to people massively underestimating how much they're spending on queries/ tokens.

To Reproduce Here's code to reproduce the issue:

import json
from scrapegraphai.utils import prettify_exec_info
from scrapegraphai.graphs import SearchGraph

graph_config = {
    "llm": {

        "model": "gpt-4o-mini",
    },
    "verbose": True,
    "headless": True,
    "max_results": 5
}

product_name_1 = 'Sony WH1000XM4 Wireless Noise Canceling Over-Ear Headphones - Black'
product_name_2 = 'Sony WH1000XM5 Wireless Noise Canceling Over-Ear Headphones - Black'

search_graph = SearchGraph(
    prompt=f"Are these 2 products the same product? Here are the two products:\nProduct 1: {product_name_1}\nProduct 2: {product_name_2}\nYour output should be exactly 'yes' or 'no'.",
    config=graph_config
)

result = search_graph.run()
print(json.dumps(result, indent=4))

graph_exec_info = search_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

** Exec Info output: node_name total_tokens prompt_tokens completion_tokens successful_requests total_cost_USD exec_time 0 SearchInternet 231 213 18 1 0.000043 3.495771 1 GraphIterator 0 0 0 0 0.000000 4.696635 2 MergeAnswers 245 236 9 1 0.000041 0.364202 3 TOTAL RESULT 476 449 27 2 0.000084 8.556608 image

Expected behavior I'd expect the GraphIterator to show the tokens that it used instead of 0. Alternatively, it should find all of the subgraphs used during the running of this graph and either print those within this graph_exec_info ie. 0 SearchInternet 1 GraphIterator 2 SmartScraperGraph 3...

VinciGit00 commented 2 months ago

Oh you have right, we will add it

LorenzoPaleari commented 2 months ago

Hi, I took the liberty to work on this error.

Context on the Error

As for the Issue title, the error is encountered when graphs executions are nested into the others. For example in SearchGraph, the IteratorNode will call multiple instances of SmartScraperGraph creating a nested graph structure.

What happens at code level is the following. Every graph during execution will try to catch all the token informations from OpenAi calls using an OpenAi handler with get_openai_callback() as cb:. We end up having the following structure.

# SearchGraph is executed
with get_openai_callback() as cb1:
    # SearchInternetNode executed
    # Token informations gathered

    # IteratorNode executed
         # SmartScraperGraph executed
         with get_openai_callback() as cb2:
             # OpenAi handler gets passed to CB2. CB1 loses the handler (~paused).

             # All graph executed
             # Token informations gathered by CB2

             # OpenAi handler released

         # CB1 resume and re-obtain the handler.
         # No information about the token is available, has been "consumed" by CB2

Proposed Fix

To fix the error we can create a CustomContextManager that manages exclusive access to the OpenAi handler.

custom_openai_callback.py

import threading
from contextlib import contextmanager
from langchain_community.callbacks import get_openai_callback

class CustomOpenAiCallbackManager:
    _lock = threading.Lock()

    @contextmanager
    def exclusive_get_openai_callback(self):
        if CustomOpenAiCallbackManager._lock.acquire(blocking=False):
            try:
                with get_openai_callback() as cb:
                    yield cb
            finally:
                CustomOpenAiCallbackManager._lock.release()
        else:
            yield None

base_graph.py

self.callback_manager = CustomOpenAiCallbackManager()
[...]

with self.callback_manager.exclusive_get_openai_callback() as cb:
[...]

if cb is not None:
   # update exec_info

Result

        node_name  total_tokens  prompt_tokens  completion_tokens  successful_requests  total_cost_USD  exec_time
0  SearchInternet           170            161                  9                    1        0.000030   3.868281
1   GraphIterator         46841          46456                385                    5        0.007199  10.237838
2    MergeAnswers          1152            825                327                    1        0.000320   3.569431
3    TOTAL RESULT         48163          47442                721                    7        0.007549  17.675550

Solution Downsides

We lose nested graph detailed cost informations, we do not know how the cost inside GraphIterator is divided (5 calls to SmartScraperGraph, that is composed by FetchNode, ParseNode...)

A solution to obtain this kind of detailed information would require more engineering on the CustomOpenAiCallback. I can work on this in the next days.

I think that for now it is already good to have at least the complete cost of an execution, so I opened a PR for this Issue, see if you like the proposed solution.

670

VinciGit00 commented 1 month ago

hi, please update to the new version