Open CloCkWeRX opened 3 months ago
In theory,
from scrapy.spiders import SitemapSpider
from locations.categories import Categories
from locations.structured_data_spider import StructuredDataSpider
from locations.settings import DEFAULT_PLAYWRIGHT_SETTINGS
from locations.user_agents import BROWSER_DEFAULT
class AceAndTateSpider(SitemapSpider, StructuredDataSpider):
name = "ace_and_tate"
sitemap_urls = ["https://www.aceandtate.com/robots.txt"]
# Example: https://www.aceandtate.com/nl-en/stores/netherlands/amsterdam/van-woustraat-67-h
sitemap_rules = [(r"/stores/[\w-]+/[\w-]+/[\w-]+$", "parse")]
item_attributes = {"brand": "Ace & Tate", "brand_wikidata": "Q110516413"}
wanted_types = ["Optician"]
user_agent = BROWSER_DEFAULT
is_playwright_spider = True
However I get bot protection behaviours requesting even the robots.txt
Brand name
Ace & Tate
Wikidata ID
Q110516413 https://www.wikidata.org/wiki/Q110516413 https://www.wikidata.org/wiki/Special:EntityData/Q110516413.json
Store finder url(s)
https://www.aceandtate.com/nl-en/stores Official Url(s): https://www.aceandtate.com/
pipenv run scrapy sf --brand-wikidata=Q110516413 https://www.aceandtate.com/
Sample store page url
https://www.aceandtate.com/nl-en/stores/netherlands/amsterdam/van-woustraat-67-h
Countries?
Multiple
Difficulty?
None
Number of POI?
70?
Behaviours
pipenv run scrapy sd (specific page url)
) or validator has contentpipenv run scrapy sitemap (url)
)