Unable to properly scrape certain web pages (i.e. large number or clients / products / office locations).

georgihacker commented 1 month ago

We would like to extract all the clients / products / offices of certain companies. ScrapeGraph-AI gives only partial results unfortunately when the elements are more than 25-30. Is it possible to fix this and get more reliable results?

For example, we get only 47 office locations for BCG, while there are many more in reality.

prompt = """
Extract all addresses from the specified URL in a CSV table format – only consider data from the URL provided, 
not any other sources or URLs: office name / legal entity (use what is shown on website before street address and combine in one cell), 
street, zip code, city, country, source page link. Do not include P.O. box or post office box addresses. 
"""

url = "https://www.bcg.com/offices/default"

graph_config = {
    "llm": {
        "api_key": XXX,
        "model": "gpt-4o",
        "temperature": 0,
    },
    "verbose": True,
    "headless": True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt=prompt_,
    source=url_,
    config=graph_config
)
result = smart_scraper_graph.run()
if result is not None and 'addresses' in result:
    result = result['addresses']
    result = pd.DataFrame(result)

print(result)
print(result.shape)

--- Executing Fetch Node --- --- (Fetching HTML from: https://www.bcg.com/offices/default) --- --- Executing Parse Node --- --- Executing RAG Node --- --- (updated chunks metadata) --- --- (tokens compressed and vector stored) --- --- Executing GenerateAnswer Node --- Processing chunks: 100%|██████████| 1/1 [01:00<00:00, 60.46s/it] office_name_legal_entity ... source_page_link 0 BCG Cairo ... https://www.bcg.com/offices/cairo 1 BCG Casablanca ... https://www.bcg.com/offices/casablanca/default 2 Platinion – Casablanca ... https://www.bcg.com/offices/platinion-casablan... 3 BCG Johannesburg ... https://www.bcg.com/offices/johannesburg/default 4 Platinion – Johannesburg ... https://www.bcg.com/offices/platinion-johannes... 5 BCG Lagos ... https://www.bcg.com/offices/lagos 6 BCG Nairobi ... https://www.bcg.com/offices/nairobi 7 BCG Auckland ... https://www.bcg.com/offices/auckland/default 8 BCG Bangkok ... https://www.bcg.com/offices/bangkok/default 9 Platinion - Bangkok ... https://www.bcg.com/offices/platinion-bangkok/... 10 BCG Beijing ... https://www.bcg.com/offices/beijing/default 11 ACC – Bengaluru ... https://www.bcg.com/offices/acc-bengaluru 12 BCG Bengaluru ... https://www.bcg.com/offices/bengaluru 13 Platinion – Bengaluru ... https://www.bcg.com/offices/platinion-bengalur... 14 BCG Canberra ... https://www.bcg.com/offices/canberra/default 15 Platinion – Canberra ... https://www.bcg.com/offices/platinion-canberra... 16 BCG Chennai ... https://www.bcg.com/offices/chennai/default 17 BCG Fukuoka ... https://www.bcg.com/offices/fukuoka/default 18 BCG Ho Chi Minh City ... https://www.bcg.com/offices/ho-chi-minh-city/d... 19 Platinion - Ho Chi Minh City ... https://www.bcg.com/offices/platinion-ho-chi-m... 20 BCG Hong Kong ... https://www.bcg.com/offices/hong-kong/default 21 Platinion – Hong Kong ... https://www.bcg.com/offices/platinion-hong-kon... 22 BCG Jakarta ... https://www.bcg.com/offices/jakarta/default 23 BCG Kuala Lumpur ... https://www.bcg.com/offices/kuala-lumpur/default 24 BCG Kyoto ... https://www.bcg.com/offices/kyoto/default 25 BCG Manila ... https://www.bcg.com/offices/manila 26 BCG Melbourne ... https://www.bcg.com/offices/melbourne/default 27 Platinion – Melbourne ... https://www.bcg.com/offices/platinion-melbourn... 28 BCG Mumbai - Bandra Kurla Complex ... https://www.bcg.com/offices/mumbai-bkc 29 BCG Mumbai - Nariman Point ... https://www.bcg.com/offices/mumbai/default 30 Platinion – Mumbai ... https://www.bcg.com/offices/platinion-mumbai/d... 31 BCG Nagoya ... https://www.bcg.com/offices/nagoya/default 32 ACC – Gurugram ... https://www.bcg.com/offices/acc-gurugram 33 BCG New Delhi ... https://www.bcg.com/offices/new-delhi/default 34 Platinion – New Delhi ... https://www.bcg.com/offices/platinion-new-delh... 35 BCG Osaka ... https://www.bcg.com/offices/osaka/default 36 BCG Perth ... https://www.bcg.com/offices/perth/default 37 BCG Seoul ... https://www.bcg.com/offices/seoul/default 38 BCG Shanghai ... https://www.bcg.com/offices/shanghai/default 39 BCG Shenzhen ... https://www.bcg.com/offices/shenzhen/default 40 BCG Singapore ... https://www.bcg.com/offices/singapore/default 41 Expand Research – Singapore ... https://www.bcg.com/offices/expand-research-si... 42 ValueScience Center – Singapore ... https://www.bcg.com/offices/value-science-cent... 43 Platinion - Singapore ... https://www.bcg.com/offices/platinion-singapor... 44 BCG Sydney ... https://www.bcg.com/offices/sydney/default 45 Platinion – Sydney ... https://www.bcg.com/offices/platinion-sydney/d... 46 BCG Taipei ... NaN

[47 rows x 6 columns] (47, 6)

beiyanpiki commented 1 month ago

Same issue here, It looks like the model hit its 4096 output token limit, which sometimes causes fail to dump JSON during output.

Is there any way such as using continuous dialogue to allow the model to continue its output?

VinciGit00 commented 1 month ago

Why you don't split the call in 2 parts?

georgihacker commented 1 month ago

Why you don't split the call in 2 parts?

How to do the split @VinciGit00?

VinciGit00 commented 1 month ago

hi, please try with the new beta, we should have solved the problem

ScrapeGraphAI / Scrapegraph-ai

Unable to properly scrape certain web pages (i.e. large number or clients / products / office locations). #441