algolia / docsearch-scraper

DocSearch - Scraper
https://docsearch.algolia.com/
Other
305 stars 106 forks source link

Getitng only 1 NB hit while running from docker #571

Closed cybersaksham closed 1 year ago

cybersaksham commented 1 year ago

I am running the command

docker run -it --env-file=.env -e "CONFIG=$(cat ./config.json | jq -r tostring)" algolia/docsearch-scraper

Getting the below output:

> DocSearch: https://portfolio-generator.cybersaksham.co.in/ 1 records)

Nb hits: 1

It is crawling only home page while I also have documentation pages. Please tell me how to solve it. Thanks!

And my config.json contains:

{
  "index_name": "portfolio-generator",
  "start_urls": ["https://portfolio-generator.cybersaksham.co.in/"],
  "sitemap_urls": [
    "https://portfolio-generator.cybersaksham.co.in/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": ["/tests"],
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": ["language", "version", "type", "docusaurus_tag"],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": ["833762294"],
  "nb_hits": 9000
}
jackm767 commented 1 year ago
  "js_render": true,
  "js_wait": true,

Options in your config may be what you are missing. I had the same issue

cybersaksham commented 1 year ago
  "js_render": true,
  "js_wait": true,

Options in your config may be what you are missing. I had the same issue

I am speechless. I tried this solution already yesterday but didn't get the desired answer. When I tried this today, it worked. Maybe I messed up with some other options last time. Thank you for the solution.

But I am having one more problem now. gallery

When I search gallery I am getting 6 results with no difference. I found the problem that is I have a sidebar named Website Gallery and it has 6 items inside it. That's why the search is being done for all 6.

Can you please tell me how can I ignore sidebar headings in config.json? All attributes are the same except for the 2 that you suggested to me.

cybersaksham commented 1 year ago

I am closing this issue because the original problem is solved and the new problem https://github.com/algolia/docsearch-scraper/issues/571#issuecomment-1368670652 is referenced in a new issue #572