alltheplaces / alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet.
https://www.alltheplaces.xyz
Other
624 stars 212 forks source link

Sunglass Hut (Latin America locations) #6005

Open davidhicks opened 1 year ago

davidhicks commented 1 year ago

Brand name

Sunglass Hut

Wikidata ID

Q136311

Store finder url(s)

Chile: https://latam.sunglasshut.com/cl/tienda.php Colombia: https://latam.sunglasshut.com/co/tienda.html Peru: https://latam.sunglasshut.com/pe/tienda.php

srujan-landeri commented 1 year ago

Hi @davidhicks The data you given through links in not in json format. How do I extract the data? Thanks.

davidhicks commented 1 year ago

This spider is a medium difficulty. A rough outline of the spider needed is:

start_urls = [
  "https://latam.sunglasshut.com/cl/tienda.php",
  "https://latam.sunglasshut.com/co/tienda.html",
  "https://latam.sunglasshut.com/pe/tienda.php",
]

Then for each start_url page, the HTML needs to be parsed to get state/province IDs:

def parse(self, response):
  state_ids = response.xpath('//select[@id="state"]/option/@value').getall()

Then you can yield a FormRequest to extract location information for that state/province, using the following example API call:

curl 'https://latam.sunglasshut.com/cl/js_cargarSelect2.php' -X POST -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data-raw 'edo_tienda=9&id_prov=1&query=D'

This becomes something like:

def parse(self, response):
  state_ids = ...
  for state_id in state_ids:
    url = response.url.split("/tienda", 1)[0] + "/js_cargarSelect2.php"
    yield FormRequest(url=url, method="POST", formdata={"edo_tienda": state_id, "id_prov": 1, "query": "D"}, callback=self.parse_state_locations)

Then you can parse locations for each state per (I can never remember the syntax for multiple assignments, but you'll get the idea):

from locations.google_url import url_to_coords
from locations.items import Feature

def parse_state_locations(self, response):
  for location in response.json()["contenido"]:
    properties = {
      "name", "address" = location.split("|", 3)[0:1],
      "lat", "lng" = url_to_coords(location.split("|", 3)[2]),
    }
    yield Feature(**properties)