ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
14.39k stars 1.17k forks source link

result is empty for any url from domain http://www.mckinsey.com #501

Closed regismvargas closed 1 month ago

regismvargas commented 1 month ago

Describe the bug All URLs from domain is returning empty result

To Reproduce Domain: http://www.mckinsey.com

URLs tested and not working: https://www.mckinsey.com/features/mckinsey-center-for-future-mobility/our-insights/autonomous-vehicles-moving-forward-perspectives-from-industry-leaders https://www.mckinsey.com/industries/automotive-and-assembly/our-insights/autonomous-drivings-future-convenient-and-connected

Prompt: Summarize and find the main topics

My code:

# Config the graph

graph_config = {
    "llm": {
        "api_key": GEMINI_API_KEY,
        "model": "gemini-pro",
    },
    "verbose":True,
    "headless":True,
    "max_results": True
}

# Run SmartScraperGraph instance
my_prompt = f"Summarize and find the main topics"

smart_scraper_graph = SmartScraperGraph(
    prompt=my_prompt,
    # also accepts a string with the already downloaded HTML code
    source="https://www.mckinsey.com/features/mckinsey-center-for-future-mobility/our-insights/autonomous-vehicles-moving-forward-perspectives-from-industry-leaders",
    config=graph_config
)

# Run the graph
result = smart_scraper_graph.run()
print(result)

# Get graph execution info
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

Steps to reproduce the behavior:

I got this from McKinsey URLs

--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.mckinsey.com/features/mckinsey-center-for-future-mobility/our-insights/autonomous-vehicles-moving-forward-perspectives-from-industry-leaders) ---
--- Executing Parse Node ---
--- Executing GenerateAnswer Node ---
Processing chunks:   0%|          | 0/1 [00:02<?, ?it/s]{'answer': 'I apologize, but I am unable to summarize and find the main topics of the provided content as it is empty.'}
        node_name  total_tokens  prompt_tokens  completion_tokens  \
0           Fetch             0              0                  0   
1           Parse             0              0                  0   
2  GenerateAnswer           269            238                 31   
3    TOTAL RESULT           269            238                 31   

   successful_requests  total_cost_USD  exec_time  
0                    0             0.0   1.640176  
1                    0             0.0   0.002383  
2                    1             0.0   2.054185  
3                    1             0.0   3.696744 

Expected behavior

From the URL: "https://www.precedenceresearch.com/autonomous-vehicle-market"
{'Autonomous Vehicle Market Size, Share, and Trends 2024 to 2034': {'Main Topics': ['Autonomous Vehicle Market Size and Growth 2024 to 2033', 'Autonomous Vehicle Market Key Takeaways', 'Autonomous Vehicle Market Growth Factors', 'Report Scope of the Autonomous Vehicle Market', 'Autonomous Vehicle Market Drivers', 'Autonomous Vehicle Market Opportunities', 'Autonomous Vehicle Market Restraint', 'Autonomous Vehicle Market Challenge', 'Regional Insights', 'Application Insights', 'Vehicle Type Insights', 'Level of Autonomy Insights', 'Application Insights', 'Autonomous Vehicle Market Companies', 'Segments Covered in the Report', 'Frequently Asked Questions'], 'Summary': 'The global autonomous vehicle market size was estimated USD 158.31 billion in 2023 and is projected to hit around USD 2,752.80 billion by 2033, poised to grow at a compound annual growth rate (CAGR) of 33% from 2024 to 2033.\n\nU.S. autonomous vehicle market was valued at USD 59.92 billion in 2023.\n\nThe Asia-Pacific region is expected to hit at a CAGR of 35% from 2024 to 2033.\n\nBy application, the transportation segment accounted largest revenue share of 93.57% in 2023.\n\nBy vehicle type, the passenger segment accounted for 74.29% of revenue share in 2023.\n\nBy propulsion type, the semi-autonomous vehicle segment accounted for 95.13% of revenue share in 2023.\n\nBy transportation, the commercial transportation segment has accounted revenue share of 84.98% in 2023.\n\nBy Level of Automation, the Level 2 segment has accounted revenue share of 40.29% in 2023.'}}
f-aguzzi commented 1 month ago

This website might have anti-scraping protection, which is usually triggered by headless browsers. Try setting "headless":False in graph_config and let us know if you get a different result.

regismvargas commented 1 month ago

Thanks for your answer. It worked, but not as expected. See details below.

First: the original problem

"headless":False leads to another error:

╔════════════════════════════════════════════════════════════════════════════════════════════════╗
║ Looks like you launched a headed browser without having a XServer running.                     ║
║ Set either 'headless: true' or use 'xvfb-run <your-playwright-app>' before running Playwright. ║
║                                                                                                ║
║ <3 Playwright Team                                                                             ║
╚════════════════════════════════════════════════════════════════════════════════════════════════╝

I have managed it in this way:

!apt install xvfb 
!pip install pyvirtualdisplay 
import pyvirtualdisplay 
display = pyvirtualdisplay.Display().start()

source and credit: https://colab.research.google.com/drive/1or8DtXZP8ZxJYK52me0dA6O9A1dXKKOE?usp=sharing

Worked.

Second: I got several "\n" and other into 'result'

I got the following:

{'title': 'Autonomous vehicles moving forward: Perspectives from industry leaders', 'description': 'McKinsey’s 2023 global executive survey on autonomous driving reveals that despite recent uncertainties, the autonomous-vehicle industry is beginning to take shape.', 'content': '**2023 was a tipping point** for the autonomous-vehicle industry. Although leading players were able to successfully run and scale first commercial operations and increase their funding, others saw significant setbacks, stopped or reduced their operations, or exited the market entirely. This in mind, there is still much to be done before the autonomous-vehicle industry is fully mature—but how much?\n\n## About the authors\nThis article is a collaborative effort by Derek Chiao, [Johannes Deichmann](http://www.mckinsey.com/our-people/johannes-deichmann), [Kersten Heineke](http://www.mckinsey.com/our-people/kersten-heineke), Ani Kelkar, Martin Kellner, Elizabeth Scarinci, and Dmitry Tolstinev, representing views from McKinsey’s Automotive and Assembly Practice and the McKinsey Center for Future Mobility.\n\nThis past summer, the McKinsey Center for Future Mobility conducted a follow-up to its 2021 survey of industry decision makers (see sidebar “Survey methodology”).[1Kersten Heineke, Ruth Heuss, Ani Kelkar, and Martin Kellner, “[What’s next for autonomous vehicles?](http://www.mckinsey.com/features/mckinsey-center-for-future-mobility/our-insights/whats-next-for-autonomous-vehicles),” McKinsey, December 22, 2021.](javascript:void\\(0\\);)\n\nOur 2023 survey revealed that much has changed in this dynamic sector in the past two years: regional expectations are shifting, timelines for autonomous-vehicle development are extending, and needed investments are increasing. Other results reveal new opportunities for autonomous-vehicle manufacturers, such as more diversified markets and technologies with margins of 17 percent or more.\n\n## Survey methodology\nThe McKinsey Center for Future Mobility, in partnership with The Autonomous, conducts a biannual survey of leaders in the autonomous-driving industry, which took place from June to August 2023. The 2023 survey included 86 decision makers from around the globe (40 from North America, 37 from the European Union, three from China, and six from other regions). They represented some of the world’s largest software and automotive corporations, as well as prominent start-ups and supporting institutions such as universities and mapping and navigation companies. These decision makers ranged from chief experience officers and heads of strategy to systems architects and vice presidents of engineering, together presenting a holistic view on the state of the industry. In some instances, results have been combined with the 2021 baseline to provide data that is analytically rich. In this article, we offer updated insights from industry leaders in key categories: regional and market diversification, predicted timelines, expected bottlenecks, the size of needed investments, profitability of autonomous-vehicle components, and monetization models. These results shine a light on how the autonomous-vehicle industry could take shape in the years and decades to come.\n\n## Players expect regional and market diversification\n\n## About the McKinsey Center for Future Mobility\n**These insights were developed** by the McKinsey Center for Future Mobility (MCFM). Since 2011, MCFM has worked with stakeholders across the mobility ecosystem by providing independent and integrated evidence about possible future-mobility scenarios. With our unique, bottom-up modeling approach, our insights enable an end-to-end analytics journey through the future of mobility—from consumer needs to a modal mix across urban and rural areas, sales, value pools, and life cycle sustainability. [Contact us](mmip@mckinsey.com) if you are interested in getting full access to our market insights via the McKinsey Mobility Insights Portal.\n\nMost survey respondents predict that three or less companies will capture a dominant share of the market. The North American market is expected to be the most fragmented, with only 15 percent of respondents expecting that the market will be dominated by one or two players. By contrast, 38 percent of respondents predict that the European market will be dominated by two or fewer players. Predictions for the race to full autonomy are also shifting: while 58 percent of 2021 survey participants believed that North America would be the first to deploy Level 4 (L4) highway pilots, 2023 respondents were evenly split between believing China or North America would be first. This is evidence of China’s progress in the autonomous-vehicle race, driven by factors such as robust government backing; heightened investments in research and data availability; and a receptive consumer attitude toward adopting new technology.\n\nExhibit 1 ![Most survey respondents expect the autonomous-vehicle market to be dominated by more than two players.](http://www.mckinsey.com/~/media/mckinsey/features/mckinsey%20center%20for%20future%20mobility/our%20insights/autonomous%20vehicles%20moving%20forward%20perspectives%20from%20industry%20leaders/svgz-autonomousvehiclessurvey-ex1.svgz?cq=50&cpy;=Center)\n\nWe strive to provide individuals with disabilities equal access to our website. If you would like information about this content we will be happy to work with you. Please email us at: [McKinsey_Website_Accessibility@mckinsey.com](McKinsey_Website_Accessibility@mckinsey.com)\n\n## The timeline for autonomous-vehicle development is extending\n\n## Stages of autonomous-vehicle development\nSAE International, a global professional association that develops engineering standards, splits autonomous-vehicle development into five levels, referred to as Level 0 (L0) through Level 5 (L5).[1 _SAE Blog_ , “SAE Levels of Driving Automation™ refined for clarity and international audience,” SAE International, May 3, 2021.](javascript:void\\(0\\);)\n\nL0 through Level 2 require humans to drive and constantly monitor automated support systems, which include warning systems, braking and acceleration, and steering. Level 3 (L3) vehicles are the highest level of automation widely available to consumers today. At this level, a car can operate independently, but systems can request that a driver take over at any time. These systems can operate only in certain conditions, such as during traffic jams. Level 4 (L4) vehicles, which include driverless taxis, are currently being tested, developed, and deployed. Unlike L3 vehicles, L4 vehicles function without a driver who is ready to take over. L5 vehicles are fully autonomous in any environment and under all conditions. These vehicles are the final frontier for autonomous-vehicle development.\n\nThe adoption timeline for autonomous vehicles has slipped by two to three years on average across all autonomy levels relative to the 2021 survey (see sidebar “Stages of autonomous-vehicle development”). According to this year’s survey, L4 robo-taxis are now expected to become commercially available at a large scale by 2030, and fully autonomous trucking is expected to reach viability between 2028 and 2031. This may be due to ongoing technical obstacles and challenges with capital availability. In addition, regulatory challenges persist as autonomous-vehicle regulations are still being developed and enacted.\n\nDespite these projections, well-funded pioneers are pushing ahead and moving to expand deployment across geographies.\n\nExhibit 2 ![Timelines for Level 4 and Level 5 autonomous-vehicle use cases have extended by two to three years on average.](http://www.mckinsey.com/~/media/mckinsey/features/mckinsey%20center%20for%20future%20mobility/our%20insights/autonomous%20vehicles%20moving%20forward%20perspectives%20from%20industry%20leaders/svgz_autonomousvehiclessurvey-exs_ex2-v6.svgz?cq=50&cpy;=Center)\n\nWe strive to provide individuals with disabilities equal access to our website. If you would like information about this content we will be happy to work with you. Please email us at: [McKinsey_Website_Accessibility@mckinsey.com](McKinsey_Website_Accessibility@mckinsey.com)\n\n## Regulation, technology, and consumer safety are key bottlenecks and considerations for development\n\nAbout 60 percent of respondents still believe regulation is the biggest bottleneck to autonomous-vehicle adoption, the same relative importance as in the 2021 survey. However, respondents this year reported an increased focus on technology, rising from an average of 26 percent in 2021 to an average of 32 percent in 2023. Though experts do not believe consumer demand will be the main impediment to adoption, autonomous-vehicle players still have important considerations to take into account to ensure consumer uptake. Two-thirds of respondents see improved safety as a key consideration for consumers. Productivity (the ability to multitask while driving) and comfort are anticipated to be secondary considerations in customers’ willingness to pay.\n\nExhibit 3 ![For leaders in the autonomous-'}}

As I saw in the documentation, the return should be clean, plain text. What am I doing wrong or missing here?

f-aguzzi commented 1 month ago

I'm glad the first solutionw worked, at least partially.

"headless":False implies that Playwright will open a graphical instance of Chromium, which only works in a graphical user environment. That's why you had to set up a virtual display of some sort (I guess you're using Colab?). Maybe we'll make a new Colab example about this. Thanks for the input.

The second part of the problem is harder to tackle. I confirm that the output should be plain text. However, the answer is generated by an LLM, with all the problems that may stem from that, and our library uses the same system prompt for all LLMs. It took weeks of tinkering with the prompts just to reduce the amount of invalid JSON responses, and still, sometimes the output looks weird. Unless we come up with separate prompts for each model, or with a custom LLM fine-tuned for scraping, this kind of unexpected behavior will keep on showing up from time to time.