alltheplaces / alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet.
https://www.alltheplaces.xyz
Other
624 stars 213 forks source link

Ace & Tate (Sitemap, Structured Data, Bot Protection) #9500

Open CloCkWeRX opened 2 months ago

CloCkWeRX commented 2 months ago

Brand name

Ace & Tate

Wikidata ID

Q110516413 https://www.wikidata.org/wiki/Q110516413 https://www.wikidata.org/wiki/Special:EntityData/Q110516413.json

Store finder url(s)

https://www.aceandtate.com/nl-en/stores Official Url(s): https://www.aceandtate.com/

Sample store page url

https://www.aceandtate.com/nl-en/stores/netherlands/amsterdam/van-woustraat-67-h

Countries?

Multiple

Difficulty?

None

Number of POI?

70?

Behaviours

CloCkWeRX commented 2 months ago

In theory,

from scrapy.spiders import SitemapSpider

from locations.categories import Categories
from locations.structured_data_spider import StructuredDataSpider
from locations.settings import DEFAULT_PLAYWRIGHT_SETTINGS
from locations.user_agents import BROWSER_DEFAULT

class AceAndTateSpider(SitemapSpider, StructuredDataSpider):
    name = "ace_and_tate"
    sitemap_urls = ["https://www.aceandtate.com/robots.txt"]
    # Example: https://www.aceandtate.com/nl-en/stores/netherlands/amsterdam/van-woustraat-67-h
    sitemap_rules = [(r"/stores/[\w-]+/[\w-]+/[\w-]+$", "parse")]
    item_attributes = {"brand": "Ace & Tate", "brand_wikidata": "Q110516413"}
    wanted_types = ["Optician"]
    user_agent = BROWSER_DEFAULT
    is_playwright_spider = True

However I get bot protection behaviours requesting even the robots.txt