ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
15.9k stars 1.3k forks source link

use in windows 10 error : UnboundLocalError: local variable 'browser' referenced before assignment #777

Open 1272870698 opened 3 weeks ago

1272870698 commented 3 weeks ago

playwright install :

image

use : image

error :

UnboundLocalError: local variable 'browser' referenced before assignment

detail is :

Cell In[4], line 27
     18 # ************************************************
     19 # Create the SmartScraperGraph instance and run it
     20 # ************************************************
     21 smart_scraper_graph = SmartScraperGraph(
     22     prompt="Find some information about what does the company do, the name and a contact email.",
     23     source="https://scrapegraphai.com/",
     24     config=graph_config
     25 )
---> 27 result = smart_scraper_graph.run()
     28 print(result)

File D:\softs\anaconda3\envs\flux\lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py:212, in SmartScraperGraph.run(self)
    204 """
    205 Executes the scraping process and returns the answer to the prompt.
    206 
    207 Returns:
    208     str: The answer to the prompt.
    209 """
    211 inputs = {"user_prompt": self.prompt, self.input_key: self.source}
--> 212 self.final_state, self.execution_info = self.graph.execute(inputs)
    214 return self.final_state.get("answer", "No answer found.")

File D:\softs\anaconda3\envs\flux\lib\site-packages\scrapegraphai\graphs\base_graph.py:284, in BaseGraph.execute(self, initial_state)
    282     return (result["_state"], [])
    283 else:
--> 284     return self._execute_standard(initial_state)

File D:\softs\anaconda3\envs\flux\lib\site-packages\scrapegraphai\graphs\base_graph.py:198, in BaseGraph._execute_standard(self, initial_state)
    185     graph_execution_time = time.time() - start_time
    186     log_graph_execution(
    187         graph_name=self.graph_name,
    188         source=source,
   (...)
    196         exception=str(e)
    197     )
--> 198     raise e
    199 node_exec_time = time.time() - curr_time
    200 total_exec_time += node_exec_time

File D:\softs\anaconda3\envs\flux\lib\site-packages\scrapegraphai\graphs\base_graph.py:182, in BaseGraph._execute_standard(self, initial_state)
    180 with self.callback_manager.exclusive_get_callback(llm_model, llm_model_name) as cb:
    181     try:
--> 182         result = current_node.execute(state)
    183     except Exception as e:
    184         error_node = current_node.node_name

File D:\softs\anaconda3\envs\flux\lib\site-packages\scrapegraphai\nodes\fetch_node.py:130, in FetchNode.execute(self, state)
    128     return self.handle_local_source(state, source)
    129 else:
--> 130     return self.handle_web_source(state, source)

File D:\softs\anaconda3\envs\flux\lib\site-packages\scrapegraphai\nodes\fetch_node.py:305, in FetchNode.handle_web_source(self, state, source)
    303 else:
    304     loader = ChromiumLoader([source], headless=self.headless, **loader_kwargs)
--> 305     document = loader.load()
    307 if not document or not document[0].page_content.strip():
    308     raise ValueError("""No HTML body content found in
    309                      the document fetched by ChromiumLoader.""")

File D:\softs\anaconda3\envs\flux\lib\site-packages\langchain_core\document_loaders\base.py:31, in BaseLoader.load(self)
     29 def load(self) -> list[Document]:
     30     """Load data into Document objects."""
---> 31     return list(self.lazy_load())

File D:\softs\anaconda3\envs\flux\lib\site-packages\scrapegraphai\docloaders\chromium.py:192, in ChromiumLoader.lazy_load(self)
    189 scraping_fn = getattr(self, f"ascrape_{self.backend}")
    191 for url in self.urls:
--> 192     html_content = asyncio.run(scraping_fn(url))
    193     metadata = {"source": url}
    194     yield Document(page_content=html_content, metadata=metadata)

File D:\softs\anaconda3\envs\flux\lib\site-packages\nest_asyncio.py:30, in _patch_asyncio.<locals>.run(main, debug)
     28 task = asyncio.ensure_future(main)
     29 try:
---> 30     return loop.run_until_complete(task)
     31 finally:
     32     if not task.done():

File D:\softs\anaconda3\envs\flux\lib\site-packages\nest_asyncio.py:98, in _patch_loop.<locals>.run_until_complete(self, future)
     95 if not f.done():
     96     raise RuntimeError(
     97         'Event loop stopped before Future completed.')
---> 98 return f.result()

File D:\softs\anaconda3\envs\flux\lib\asyncio\futures.py:201, in Future.result(self)
    199 self.__log_traceback = False
    200 if self._exception is not None:
--> 201     raise self._exception.with_traceback(self._exception_tb)
    202 return self._result

File D:\softs\anaconda3\envs\flux\lib\asyncio\tasks.py:232, in Task.__step(***failed resolving arguments***)
    228 try:
    229     if exc is None:
    230         # We use the `send` method directly, because coroutines
    231         # don't have `__iter__` and `__next__` methods.
--> 232         result = coro.send(None)
    233     else:
    234         result = coro.throw(exc)

File D:\softs\anaconda3\envs\flux\lib\site-packages\scrapegraphai\docloaders\chromium.py:136, in ChromiumLoader.ascrape_playwright(self, url)
    134             results = f"Error: Network error after {self.RETRY_LIMIT} attempts - {e}"
    135     finally:
--> 136         await browser.close()
    138 return results

pelase help me

VinciGit00 commented 2 weeks ago

which side is this?

VinciGit00 commented 2 weeks ago

Hi can I have reply?

aleenprd commented 2 weeks ago

I am having the same issue using a container to run the code. Problem not present in local run

aleenprd commented 2 weeks ago

which side is this?

what do you mean side??

VinciGit00 commented 1 week ago

Website

aleenprd commented 1 week ago

I don't think it's relevant. Also watch my issue I just posted

calvincolton commented 5 days ago

I too am experiencing the same issue with a container. I am currently trying with the python playwright docker image:

# Use Playwright's official Docker image
FROM mcr.microsoft.com/playwright/python:v1.48.0-noble

ARG OPENAI_API_KEY
ENV OPENAI_API_KEY=$OPENAI_API_KEY

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app /app

ENV PYTHONPATH=/app

CMD ["sh", "-c", "export PYTHONPATH=/app && python main.py"]

I have followed the quick install instructions and have tried both headless, and non-headless options, i.e.:

graph_config = {
            "llm": {
                "api_key": OPENAI_API_KEY,
                "model": "openai/gpt-4o-mini",
            },
            "verbose": True,
            "headless": True,  # Headless mode for Docker compatibility
        }

The below might not fix it, but It looks as though there is an error in the scrapegraphai/docloaders/chromium.py file--the browser variable in the finally block is not guaranteed to be instantiated. It should be instantiated before the try/except/block, assigned a nullish value, e.g. None and checked with an if block in the finally block