algolia / docsearch-scraper

DocSearch - Scraper
https://docsearch.algolia.com/
Other
305 stars 106 forks source link

Getting Unreachable hosts error when trying to scrape data #576

Open beeena opened 1 year ago

beeena commented 1 year ago

I'm trying to scrape data using the following command.

docker run -it --env-file=./config/development/dev.env -e "CONFIG=$(cat ./config/config.json | jq -r tostring)" algolia/docsearch-scraper

Although I have ensured the usage of an accurate API-key and App-ID, I am encountering an error of "Unreachable hosts".

Error

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/src/index.py", line 119, in <module>
    run_config(environ['CONFIG'])
  File "/root/src/index.py", line 45, in run_config
    config.query_rules
  File "/root/src/algolia_helper.py", line 21, in __init__
    self.index_name_tmp
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/algoliasearch/search_client.py", line 127, in copy_rules
    return self.copy_index(src_index_name, dst_index_name, request_options)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/algoliasearch/search_client.py", line 94, in copy_index
    request_options,
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/algoliasearch/http/transporter.py", line 35, in write
    return self.request(verb, hosts, path, data, request_options, timeout)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/algoliasearch/http/transporter.py", line 72, in request
    return self.retry(hosts, request, relative_url)
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/algoliasearch/http/transporter.py", line 94, in retry
    raise AlgoliaUnreachableHostException("Unreachable hosts")
algoliasearch.exceptions.AlgoliaUnreachableHostException: Unreachable hosts

config.json


{
    "index_name": "dev_RESORTIFI_HELP",
    "start_urls": [
      "https://help.resortifi.com/"
    ],
    "sitemap_urls": [
      "https://help.resortifi.com/sitemap.xml"
    ],
    "sitemap_alternate_links": true,
    "stop_urls": [
      "/tests"
    ],
    "selectors": {
      "lvl0": {
        "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
        "type": "xpath",
        "global": true,
        "default_value": "Documentation"
      },
      "lvl1": "header h1",
      "lvl2": "article h2",
      "lvl3": "article h3",
      "lvl4": "article h4",
      "lvl5": "article h5, article td:first-child",
      "lvl6": "article h6",
      "text": "article p, article li, article td:last-child"
    },
    "strip_chars": " .,;:#",
    "custom_settings": {
      "separatorsToIndex": "_",
      "attributesForFaceting": [
        "language",
        "version",
        "type",
        "docusaurus_tag"
      ],
      "attributesToRetrieve": [
        "hierarchy",
        "content",
        "anchor",
        "url",
        "url_without_anchor",
        "type"
      ]
    },
    "conversation_id": [
      "833762294"
    ],
    "nb_hits": 46250
  }