Closed ArthurFlag closed 3 years ago
Hi,
This could be due to some client-side rendering on your website.
You could try with these options to see if it improves your results.
"js_render": true,
"js_wait": 1
Feel free to send me a gist with your config file so I can take a look at it!
Hi Shortcuts, thanks for the quick answer.
I just tried adding these options to my config, and it doesn't index anything now? 🤔
Neither start url nor regex: default, we scrap all
Getting http://developers.talon.one/sitemap.xml from selenium
Crawling issue: nbHits 0 for docs
I run docsearch with
docker run -it --env-file=.env -e
"CONFIG=$(cat scripts/docsearch-scraper/config.json | jq -r tostring)"
algolia/docsearch-scraper'
My config looks like this:
{
"index_name": "docs",
"selectors": {
"lvl0": "h1",
"lvl1": "h2",
"lvl2": "h3",
"lvl3": "p"
},
"use_anchors": true,
"sitemap_urls": [
"http://mysite.com/sitemap.xml"
],
"": [
"/"
],
"force_sitemap_urls_crawling": true
}
Any clue?
Hi @ArthurFlageul,
I just ran 10 crawl tasks with the following config and constantly got 13141 hits
{
"index_name": "docs",
"selectors": {
"lvl0": "h1",
"lvl1": "h2",
"lvl2": "h3",
"lvl3": "p"
},
"js_render": true,
"sitemap_urls": [
"http://developers.talon.one/sitemap.xml"
],
"start_urls": [
"http://developers.talon.one/"
],
"force_sitemap_urls_crawling": true,
"nb_hits": 13141
}
Could you please try it on your side?
Running this config file leads to more than 10k hits, which is not what I get when I clean my index and run the first indexation, I should get 8018 hits everytime.
From my test and your test:
js_wait
seems to break the crawl somehowjs_render
doesn't seem to fix the indexing issueAny other clue?
You'd get ~8018 hits if you remove js_wait
and js_render
keys, but it could lead to inconsistencies in your search/results as you're not waiting for client-side rendered pages to be loaded, as demonstrated here:
When I run it again, in the exact same way, I get more than 10k hits and I exceed my quota. How could 2 indexations lead to different hits?
You can decide to exclude all these pages (if you know which one are client-side rendered) (see stop_urls)
Also, having more precise selectors would help unwanted hits. e.g.: "lvl3": "p"
-> "lvl3": "section p"
= 6 hits instead of 17 on the landing page.
I see, thank you. I'll noodle around a bit more, and thanks for the selector hint!
No worries, feel free to let me know if you'd more help!
Have a nice day
When I run docsearch for indexation using the Docker image, using a clean index, I get roughly 8000 hits.
When I run it again, in the exact same way, I get more than 10k hits and I exceed my quota. How could 2 indexations lead to different hits?
This means I have to clean my index every time I want to index.