alltheplaces / alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet.
https://www.alltheplaces.xyz
Other
624 stars 211 forks source link

Truly Nolen (RioSeoSpider) #9286

Closed CloCkWeRX closed 2 months ago

CloCkWeRX commented 2 months ago

Fetched 2 brands/shop/pest_control from NSI Missing by wikidata: 1

Brand name

Truly Nolen

pest control, termite control and exterminator

Wikidata ID

Q7847671 https://www.wikidata.org/wiki/Q7847671 https://www.wikidata.org/wiki/Special:EntityData/Q7847671.json

Store finder url(s)

Official Url(s): http://www.trulynolen.com/

class TrulyNolenUSSpider(RioSeoSpider): name = "truly_nolen_us" item_attributes = { "brand_wikidata": "Q7847671", "brand": "Truly Nolen", } allowed_domains = [ "maps.locations.trulynolen.com", ] start_urls = [ "https://maps.locations.trulynolen.com/api/getAsyncLocations?template=search&level=search&lat=0&lng=0", ]

CloCkWeRX commented 2 months ago
2024-08-04 22:34:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'locations.middlewares.cdnstats.CDNStatsMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-08-04 22:34:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'locations.middlewares.track_sources.TrackSourcesMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-08-04 22:34:24 [scrapy.middleware] INFO: Enabled item pipelines:
['locations.pipelines.duplicates.DuplicatesPipeline',
 'locations.pipelines.drop_attributes.DropAttributesPipeline',
 'locations.pipelines.apply_spider_level_attributes.ApplySpiderLevelAttributesPipeline',
 'locations.pipelines.apply_spider_name.ApplySpiderNamePipeline',
 'locations.pipelines.country_code_clean_up.CountryCodeCleanUpPipeline',
 'locations.pipelines.state_clean_up.StateCodeCleanUpPipeline',
 'locations.pipelines.address_clean_up.AddressCleanUpPipeline',
 'locations.pipelines.phone_clean_up.PhoneCleanUpPipeline',
 'locations.pipelines.email_clean_up.EmailCleanUpPipeline',
 'locations.pipelines.extract_gb_postcode.ExtractGBPostcodePipeline',
 'locations.pipelines.assert_url_scheme.AssertURLSchemePipeline',
 'locations.pipelines.drop_logo.DropLogoPipeline',
 'locations.pipelines.closed.ClosePipeline',
 'locations.pipelines.apply_nsi_categories.ApplyNSICategoriesPipeline',
 'locations.pipelines.check_item_properties.CheckItemPropertiesPipeline',
 'locations.pipelines.count_categories.CountCategoriesPipeline',
 'locations.pipelines.count_brands.CountBrandsPipeline',
 'locations.pipelines.count_operators.CountOperatorsPipeline']
2024-08-04 22:34:24 [scrapy.core.engine] INFO: Spider opened
2024-08-04 22:34:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-08-04 22:34:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maps.locations.trulynolen.com/robots.txt> (referer: None)
2024-08-04 22:34:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maps.locations.trulynolen.com/api/getAsyncLocations?template=search&level=search&lat=0&lng=0> (referer: None)
2024-08-04 22:34:25 [scrapy.core.scraper] ERROR: Spider error processing <GET https://maps.locations.trulynolen.com/api/getAsyncLocations?template=search&level=search&lat=0&lng=0> (referer: None)
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/utils/defer.py", line 279, in iter_errback
    yield next(it)
          ^^^^^^^^
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
    return next(self.data)
           ^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
    return next(self.data)
           ^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/workspaces/alltheplaces/locations/middlewares/track_sources.py", line 29, in process_spider_output
    for item in result or []:
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 352, in <genexpr>
    return (self._set_referer(r, response) for r in result or ())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 27, in <genexpr>
    return (r for r in result or () if self._filter(r, spider))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 31, in <genexpr>
    return (r for r in result or () if self._filter(r, response, spider))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/share/virtualenvs/alltheplaces-7i_IfSds/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/workspaces/alltheplaces/locations/storefinders/rio_seo.py", line 41, in parse
    data = json.loads("[{}]".format(Selector(text=map_list).xpath("//div/text()").get()[:-1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/3.11.9/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/3.11.9/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/3.11.9/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 6 column 21 (char 91)
impiaaa commented 2 months ago

I think the autogenerator is a bit off for this storefinder. I ran into the same JSON error using unchanged autogenerator output for #9314. It worked when I removed start_urls and replaced it with end_point.

CloCkWeRX commented 2 months ago

Yeah, that's possible - there some minor tweaks to store finders on that branch, or some detection rules which are a tiny bit wonky around the maps marker ones. I'll update these issues in a bit to reflect what we know