Closed CloCkWeRX closed 2 months ago
2024-08-04 22:34:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'locations.middlewares.cdnstats.CDNStatsMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-08-04 22:34:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'locations.middlewares.track_sources.TrackSourcesMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-08-04 22:34:24 [scrapy.middleware] INFO: Enabled item pipelines:
['locations.pipelines.duplicates.DuplicatesPipeline',
'locations.pipelines.drop_attributes.DropAttributesPipeline',
'locations.pipelines.apply_spider_level_attributes.ApplySpiderLevelAttributesPipeline',
'locations.pipelines.apply_spider_name.ApplySpiderNamePipeline',
'locations.pipelines.country_code_clean_up.CountryCodeCleanUpPipeline',
'locations.pipelines.state_clean_up.StateCodeCleanUpPipeline',
'locations.pipelines.address_clean_up.AddressCleanUpPipeline',
'locations.pipelines.phone_clean_up.PhoneCleanUpPipeline',
'locations.pipelines.email_clean_up.EmailCleanUpPipeline',
'locations.pipelines.extract_gb_postcode.ExtractGBPostcodePipeline',
'locations.pipelines.assert_url_scheme.AssertURLSchemePipeline',
'locations.pipelines.drop_logo.DropLogoPipeline',
'locations.pipelines.closed.ClosePipeline',
'locations.pipelines.apply_nsi_categories.ApplyNSICategoriesPipeline',
'locations.pipelines.check_item_properties.CheckItemPropertiesPipeline',
'locations.pipelines.count_categories.CountCategoriesPipeline',
'locations.pipelines.count_brands.CountBrandsPipeline',
'locations.pipelines.count_operators.CountOperatorsPipeline']
2024-08-04 22:34:24 [scrapy.core.engine] INFO: Spider opened
2024-08-04 22:34:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-08-04 22:34:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maps.locations.trulynolen.com/robots.txt> (referer: None)
2024-08-04 22:34:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maps.locations.trulynolen.com/api/getAsyncLocations?template=search&level=search&lat=0&lng=0> (referer: None)
2024-08-04 22:34:25 [scrapy.core.scraper] ERROR: Spider error processing <GET https://maps.locations.trulynolen.com/api/getAsyncLocations?template=search&level=search&lat=0&lng=0> (referer: None)
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/utils/defer.py", line 279, in iter_errback
yield next(it)
^^^^^^^^
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
return next(self.data)
^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
return next(self.data)
^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/workspaces/alltheplaces/locations/middlewares/track_sources.py", line 29, in process_spider_output
for item in result or []:
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 352, in <genexpr>
return (self._set_referer(r, response) for r in result or ())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 27, in <genexpr>
return (r for r in result or () if self._filter(r, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 31, in <genexpr>
return (r for r in result or () if self._filter(r, response, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/workspaces/alltheplaces/locations/storefinders/rio_seo.py", line 41, in parse
data = json.loads("[{}]".format(Selector(text=map_list).xpath("//div/text()").get()[:-1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.pyenv/versions/3.11.9/lib/python3.11/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.pyenv/versions/3.11.9/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.pyenv/versions/3.11.9/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 6 column 21 (char 91)
I think the autogenerator is a bit off for this storefinder. I ran into the same JSON error using unchanged autogenerator output for #9314. It worked when I removed start_urls
and replaced it with end_point
.
Yeah, that's possible - there some minor tweaks to store finders on that branch, or some detection rules which are a tiny bit wonky around the maps marker ones. I'll update these issues in a bit to reflect what we know
Fetched 2 brands/shop/pest_control from NSI Missing by wikidata: 1
Brand name
Truly Nolen
pest control, termite control and exterminator
Wikidata ID
Q7847671 https://www.wikidata.org/wiki/Q7847671 https://www.wikidata.org/wiki/Special:EntityData/Q7847671.json
Store finder url(s)
Official Url(s): http://www.trulynolen.com/
pipenv run scrapy sf --brand-wikidata=Q7847671 http://www.trulynolen.com/
https://locations.trulynolen.com/class TrulyNolenUSSpider(RioSeoSpider): name = "truly_nolen_us" item_attributes = { "brand_wikidata": "Q7847671", "brand": "Truly Nolen", } allowed_domains = [ "maps.locations.trulynolen.com", ] start_urls = [ "https://maps.locations.trulynolen.com/api/getAsyncLocations?template=search&level=search&lat=0&lng=0", ]