ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
14.33k stars 1.17k forks source link

Scrapegraph returns relative path URLs instead of absolute path **Possible Bug?** #544

Closed sandeepchittilla closed 5 hours ago

sandeepchittilla commented 1 month ago

Describe the bug When using gpt4o as the llm and scraping a webpage to return a list of links, sometimes the paths returned are :

The behaviour was consistent until 3 days ago i.e. it always returned full paths on a large dataset as well. Since then, I had to uninstall Scrapegraph and reinstall the library and that's when this issue started popping up.

Expected behavior For example : asking to scrape a website www.some-actual-website.com and return a list of webpages that contain information about the contact details of the company, used to consistently/always return a json like :

{"list_of_urls": "['www.some-actual-website.com/about','www.some-actual-website.com/contact-us']"}

However, now I get either :

{"list_of_urls": "['https://example.com/about', 'https://example.com/contact-us']"}

OR

{"list_of_urls": "['/about','/contact-us']"}

I'm curious , shouldn't the list of URLs being parsed/scraped be a straightforward output? Is the final output always produced by the LLM?

Desktop (please complete the following information):

ekinsenler commented 1 month ago

I think for your case, returned HTML content from the fetch node doesn't contain any link and LLM is making up some random links that are similar to the prompted ones. Did you try disabling headless in graph config? graph_config = { ... "verbose": True, "headless": False, }

sandeepchittilla commented 1 month ago

@ekinsenler Tried this with "headless": False and Xvfb and still yields the same results.

Is this anything to do with SmartScraperGraph which scrapes the source url only and instead i should use DeepScraperGraph which can go further? Or am i completely off? 🤔

ekinsenler commented 1 month ago

SmartScraperGraph doesn't use URL search inside the HTML content as far as I know. For that purpose, I am using DeepScraperGraph. But you need to implement a filter on some level inside the search_link_node to prevent unrelated links from getting fetched. I am also working on a similar project. DeepScraperGraph doesn't have a parameter to control the max_depth, therefore scraping doesn't terminate even with some filters such as filtering out URLs that are outside of the domain name or filtering out different language versions of the website.

sandeepchittilla commented 1 month ago

@ekinsenler Hmm... I'm curious - Is this not a functional example yet? There seems to be a max_depth parameter. However, I am facing errors using this Scraper https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/2333b513aafae3c358225a8f82f6c01964c0514e/examples/openai/deep_scraper_openai.py

ekinsenler commented 1 month ago

I don't think max_depth is functional yet. DeepScraperGraph is also buggy as it requires you to use an embedder_model that seems not functioning inside the code. I fixed the error on my local and run the graph but it gets lost inside the URL tree potentially looping between the same URLs.

sandeepchittilla commented 1 month ago

You're right I had the same issue. I suppose there's not much I can do now other than wait for the developers to push DeepScraperGraph to release? I would be really interested if you get this working on your local/branch :)

ekinsenler commented 1 month ago

I am waiting on a confirmation if devs are already working on the fix of this issue or else I am going to create a pull request.

f-aguzzi commented 4 weeks ago

@sandeepchittilla @ekinsenler I can confirm that the DeepScraper is broken, and your issues are caused by a problem on our side - I'll give you a better explanation in a few days, and then see what can be done to fix the problem.

f-aguzzi commented 3 weeks ago

Let's get back to this.

Basically, a "deep scraper" with crawling capabilities has been on the roadmap for a while. We had a contributor make a system to implement it, but it was very heavy and slow. It didn't check for loops, it used a SmartScraperGraph instance on every page, and more importantly, it introduced a signal-based approach that was hard to parallelize and required significant modifications to the existing graph engine. Therefore, this design was rejected.

Around that time, part of the team started working on a proper deep scraper, with a more modular desing that fit better within the exsiting framework. See #260 for more information. The work was left undone, due to shifting priorities. The SearchLinkNode, for example, was intended to be a piece of the DeepScraperGraph pipeline.

The examples for the DeepScraperGraph are still based on deprecated code that relies on a RAG-based approach. This was removed around a month ago. We're currently focusing on cleaning up and refactoring our baseline, so most dead code will soon be removed, along with broken examples. Hopefully we'll get back to developing the deep scraper, as it would be a killer feature for the library. It's also hard to implement, though, and there's already much work on our plates.

Thanks for taking interest in this library, and in this topic in particular. We'll leave this issue open for now.

datashaman commented 3 weeks ago

In my Pydantic model, I add a description which says to use the absolute URLs and it works.

class Content(BaseModel):
    url: str = Field(description="The absolute URL of the content")
    title: str
sandeepchittilla commented 3 weeks ago

@f-aguzzi thank you for taking the time out to respond and explaining the context! Understood!

@datashaman thanks for the response. Just to be clear, you request a response in your pydantic class format (prob. JSON?) and sent this as part of the prompt; where you specify the description for the field, yes?

edit: @datashaman additionally, may i ask which model are you calling?

datashaman commented 3 weeks ago

@sandeepchittilla correct, the field description hints to the LLM that it should use absolute URLs in the response. this was using the smart graph (not deep) with gpt-4o-mini as the LLM.