ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
15.9k stars 1.3k forks source link

OmniScraperGraph Invalid IPv6 URL on certain links #822

Open Graphiee opened 2 days ago

Graphiee commented 2 days ago

Hi guys, first of all great library (but I guess you're already aware of that). Small issue I've encoundered in OmniScraperGraph:

Describe the bug When using certain links with OmniScraperGraph, the program raises Invalid IPv6 URL error. The issue occurs while executing (parse.py, line 497).

Python 3.10.15 Scrapegraphai 1.31.1

Code to reproduce

from scrapegraphai.graphs import OmniScraperGraph
import os

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": True,
}

url = "https://justjoin.it/job-offer/panowie-programisci-timetable-optimization-specialist-warszawa-other"
prompt = "Get information about the job offer."
smart_scraper_graph = OmniScrapperGraph(
    prompt=prompt,
    source=cleaned_url,
    config=graph_config
)

result = smart_scraper_graph.run()

On the other hand, running on https://nofluffjobs.com/pl/job/experienced-linux-engineer-comscore-via-cc-remote-wffuvhi5 works flawlessly. Assuming this might be a domain specific issue, but

Levyathanus commented 1 day ago

Hello, I've opened a PR regarding this issue, fixing it and trying to improve the links extraction for the parse_node.