How to scrape the dynamic pages like single page applications

cubiclesoft / ultimate-web-scraper

A PHP library/toolkit designed to handle all of your web scraping needs under a MIT or LGPL license. Also has web server and WebSocket server classes for building custom servers.

448 stars 114 forks source link

How to scrape the dynamic pages like single page applications #11

Closed ardnor closed 6 years ago

ardnor commented 6 years ago

Hello, how to get the dynamic page which uses angular and another related javascript frameworks.

Thanks,

Ronard

cubiclesoft commented 6 years ago

Yeah, those frameworks and Javascript only websites are basically broken by design. Heavy Javascript sites make scraping tasks difficult but not necessarily impossible. When I'm scraping such websites, I watch for network requests first in a regular web browser - hoping they serve up the content I want via either JSON or JSONP - and then attempt to replicate the requests using this toolkit. That bypasses the need to process HTML or Javascript in the first place.

If you've got an example page, I can take a look. I probably need to update the documentation to better address JS only sites especially since Angular and React sites are infrequently showing up on my radar.

ardnor commented 6 years ago

Hello @cubiclesoft , Thanks for this reply yes it was really difficult if the web page is not a static HTML. Here is an example page: https://www.checkmeout.ph/track/1284-1726-LHEB that is a tracking of my store and I would like to scrape the status because that platform doesn't provide an API for my information I need. Hope you will find a workaround on this.

Thank you.

cubiclesoft commented 6 years ago

I've been meaning to make a video for a while to demonstrate the various web scraping techniques I use and add it to the main documentation in this repository. This should answer your question:

https://www.youtube.com/watch?v=54tB8t1r0og

Mitch415 commented 4 months ago

This is the exact problem I'm having. Your linked YouTube video got me closer to my goal but I'm still not quite there. Here is the full URL I'm trying to scrape from https://listentotaxman.com/?year=2024&taxregion=uk&age=0&time=1&ingr=55000 you see when you visit that link that after the page finishes loading the table on the right is populated with numbers; however when viewing the source they are all still 0. Looking at the Network tab on Developer Tools I was able to find my request and the response in a file called index.js.php and in it is the data I want. It looks like JSON to me and the data item I need is called net_pay but I so far haven't been able to extract this. Please can you help?

cubiclesoft commented 4 months ago

@Mitch415 The request to the server probably needs to be a POST request with 'Content-Type: application/json'. The body of the content needs to be a JSON object.

The documentation already covers this: https://github.com/cubiclesoft/ultimate-web-scraper#sending-non-standard-requests

Mitch415 commented 4 months ago

Thankyou, I appreciate this. What I did in the meantime was look at the js on that site and found I could post directly to it and read the response with a standard Ajax XMLHttpRequestOn 6 Jul 2024, at 09:06, CubicleSoft @.***> wrote: @Mitch415 The request to the server probably needs to be a POST request with 'Content-Type: application/json'. The body of the content needs to be a JSON object. The documentation already covers this: https://github.com/cubiclesoft/ultimate-web-scraper#sending-non-standard-requests

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>