Open davidhicks opened 1 year ago
Hi @davidhicks The data you given through links in not in json
format. How do I extract the data?
Thanks.
This spider is a medium difficulty. A rough outline of the spider needed is:
start_urls = [
"https://latam.sunglasshut.com/cl/tienda.php",
"https://latam.sunglasshut.com/co/tienda.html",
"https://latam.sunglasshut.com/pe/tienda.php",
]
Then for each start_url page, the HTML needs to be parsed to get state/province IDs:
def parse(self, response):
state_ids = response.xpath('//select[@id="state"]/option/@value').getall()
Then you can yield a FormRequest
to extract location information for that state/province, using the following example API call:
curl 'https://latam.sunglasshut.com/cl/js_cargarSelect2.php' -X POST -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data-raw 'edo_tienda=9&id_prov=1&query=D'
This becomes something like:
def parse(self, response):
state_ids = ...
for state_id in state_ids:
url = response.url.split("/tienda", 1)[0] + "/js_cargarSelect2.php"
yield FormRequest(url=url, method="POST", formdata={"edo_tienda": state_id, "id_prov": 1, "query": "D"}, callback=self.parse_state_locations)
Then you can parse locations for each state per (I can never remember the syntax for multiple assignments, but you'll get the idea):
from locations.google_url import url_to_coords
from locations.items import Feature
def parse_state_locations(self, response):
for location in response.json()["contenido"]:
properties = {
"name", "address" = location.split("|", 3)[0:1],
"lat", "lng" = url_to_coords(location.split("|", 3)[2]),
}
yield Feature(**properties)
Brand name
Sunglass Hut
Wikidata ID
Q136311
Store finder url(s)
Chile: https://latam.sunglasshut.com/cl/tienda.php Colombia: https://latam.sunglasshut.com/co/tienda.html Peru: https://latam.sunglasshut.com/pe/tienda.php