algolia / docsearch-scraper

DocSearch - Scraper
https://docsearch.algolia.com/
Other
308 stars 107 forks source link

Docsearch doesn't update index correctly #543

Closed ArthurFlag closed 3 years ago

ArthurFlag commented 3 years ago

When I run docsearch for indexation using the Docker image, using a clean index, I get roughly 8000 hits.

When I run it again, in the exact same way, I get more than 10k hits and I exceed my quota. How could 2 indexations lead to different hits?

This means I have to clean my index every time I want to index.

shortcuts commented 3 years ago

Hi,

This could be due to some client-side rendering on your website.

You could try with these options to see if it improves your results.

"js_render": true,
"js_wait": 1

Feel free to send me a gist with your config file so I can take a look at it!

ArthurFlag commented 3 years ago

Hi Shortcuts, thanks for the quick answer.

I just tried adding these options to my config, and it doesn't index anything now? 🤔

Neither start url nor regex: default, we scrap all
Getting http://developers.talon.one/sitemap.xml from selenium

Crawling issue: nbHits 0 for docs

I run docsearch with

docker run -it --env-file=.env -e
    "CONFIG=$(cat scripts/docsearch-scraper/config.json | jq -r tostring)"
    algolia/docsearch-scraper'

My config looks like this:

{
  "index_name": "docs",
  "selectors": {
    "lvl0": "h1",
    "lvl1": "h2",
    "lvl2": "h3",
    "lvl3": "p"
  },
  "use_anchors": true,
  "sitemap_urls": [
    "http://mysite.com/sitemap.xml"
  ],
  "": [
    "/"
  ],
  "force_sitemap_urls_crawling": true
}

Any clue?

shortcuts commented 3 years ago

Hi @ArthurFlageul,

I just ran 10 crawl tasks with the following config and constantly got 13141 hits

{
  "index_name": "docs",
  "selectors": {
    "lvl0": "h1",
    "lvl1": "h2",
    "lvl2": "h3",
    "lvl3": "p"
  },
  "js_render": true,
  "sitemap_urls": [
    "http://developers.talon.one/sitemap.xml"
  ],
  "start_urls": [
    "http://developers.talon.one/"
  ],
  "force_sitemap_urls_crawling": true,
  "nb_hits": 13141
}

Could you please try it on your side?

ArthurFlag commented 3 years ago

Running this config file leads to more than 10k hits, which is not what I get when I clean my index and run the first indexation, I should get 8018 hits everytime.

From my test and your test:

Any other clue?

shortcuts commented 3 years ago

You'd get ~8018 hits if you remove js_wait and js_render keys, but it could lead to inconsistencies in your search/results as you're not waiting for client-side rendered pages to be loaded, as demonstrated here:

When I run it again, in the exact same way, I get more than 10k hits and I exceed my quota. How could 2 indexations lead to different hits?

You can decide to exclude all these pages (if you know which one are client-side rendered) (see stop_urls)

Also, having more precise selectors would help unwanted hits. e.g.: "lvl3": "p" -> "lvl3": "section p" = 6 hits instead of 17 on the landing page.

ArthurFlag commented 3 years ago

I see, thank you. I'll noodle around a bit more, and thanks for the selector hint!

shortcuts commented 3 years ago

No worries, feel free to let me know if you'd more help!

Have a nice day