Open dhoule opened 5 years ago
The only thing that is happening now is it can log in. My logic is wrong and I'm not able to scrape any pages, though. Working on it...
The given web page gives a different webpage for spiders than is does for actual users. The bastards!
I'm an idiot! The website updates its quantities and prices multiple times a day. Of course they would use javascript to change the innerHTML
. I need to find a way for the spider to wait till everything is done doing its job before scraping...
A spider will not trigger the javascript to run. So I need to figure out the requests the JS makes to the server side, so the spider can make them.
Need to look more into scrapy-splash
To use the scrapy-splash, have to be able to use splash, which then uses Docker.
To start the Docker splash container docker run -it -p 8050:8050 --rm scrapinghub/splash
. It's listening at http://0.0.0.0:8050
.
installed scrapy-splash
RSR's login feature, itself, uses Javascript. The scrapy-splash
route seems to be the best route, but need to look more into this Lua language, maybe.
The fact that the login feature of the webpage, would explain why I keep getting pages returned as if I'm not logged in.
Instead of following the links of a page, this time, there needs to be a list of sort things to "search" for. This will be done via looping over some array and string manipulation. Each resulting page will then be scraped.
A couple things that will be scraped:
Will be updated accordingly...