alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
MIT License
6.16k stars 648 forks source link

How to scrape a dynamic website? #71

Open vChavezB opened 2 years ago

vChavezB commented 2 years ago

I am trying to export a localhost website that is generated with this project:

https://github.com/HBehrens/puncover

The project generates a localhost website, and each time the user interacts clicks a link the project receives a GET request and the website generates the HTML. This means that the HTML is generated each time the user access a link through their browser. At the moment the project does not export the website to html or pdf. For this reason I want to know how could I recursively get all the hyperlinks and then generate the HTML version. Would this be possible with autoscraper?

yafethtb commented 1 year ago

It seems no one answer this yet. I don't know if the developers see this or not. But let me help you here. From the scraper file they create, they are using static scraper libraries like requests and BeautifulSoup. Dynamic website needs browser engine to execute the JavaScript parts of the web. Python has some libraries like Selenium or Playwright that using browser engine to render the JavaScript from dynamic webs and extract the HTML from them. But it seems autoscraper didn't use them. Maybe they will, or maybe not. As for November 23rd, 2022, I don't see any dynamic web scraper libraries used in the core file of this program.

P.S: Correct me if I'm wrong.

lrq3000 commented 1 year ago

You can supply a html argument to scraper.build() to use the output of your preferred HTML fetcher, so it should be compatible with Selenium with a bit of manual programming.