Need to start pointing the spider towards the actual website

dhoule / fra-spider

My first, self created spider. Much testing is needed.

0 stars 0 forks source link

Need to start pointing the spider towards the actual website #6

Open dhoule opened 5 years ago

dhoule commented 5 years ago

Instead of following the links of a page, this time, there needs to be a list of sort things to "search" for. This will be done via looping over some array and string manipulation. Each resulting page will then be scraped.

A couple things that will be scraped:

# This need to be found before proceeding any further
h1 id="page-title"
  span [value being looked for]

li class="list-group-item small"
  span [value being looked for]
  span [value being looked for]

Will be updated accordingly...

dhoule commented 5 years ago

The only thing that is happening now is it can log in. My logic is wrong and I'm not able to scrape any pages, though. Working on it...

dhoule commented 5 years ago

The given web page gives a different webpage for spiders than is does for actual users. The bastards!

dhoule commented 5 years ago

I'm an idiot! The website updates its quantities and prices multiple times a day. Of course they would use javascript to change the innerHTML. I need to find a way for the spider to wait till everything is done doing its job before scraping...

dhoule commented 5 years ago

A spider will not trigger the javascript to run. So I need to figure out the requests the JS makes to the server side, so the spider can make them.

dhoule commented 5 years ago

Need to look more into scrapy-splash

dhoule commented 5 years ago

To use the scrapy-splash, have to be able to use splash, which then uses Docker. Yay! Creating an account with Docker now.

dhoule commented 5 years ago

To start the Docker splash container docker run -it -p 8050:8050 --rm scrapinghub/splash. It's listening at http://0.0.0.0:8050.

dhoule commented 5 years ago

installed scrapy-splash

dhoule commented 5 years ago

RSR's login feature, itself, uses Javascript. The scrapy-splash route seems to be the best route, but need to look more into this Lua language, maybe.

dhoule commented 5 years ago

The fact that the login feature of the webpage, would explain why I keep getting pages returned as if I'm not logged in.