Closed Mitcal closed 4 years ago
👋 @Mitcal
Thank you for raising the issue. You find the issue and the workaround. Indeed, he command cat
is not available on windows ...
What I would still recommend you is to inline the configuration just like the following example:
docker run -it --env-file=.env -e "CONFIG={\"index_name\":\"treaty\",\"start_urls\":[\"https://competent-lalande-599ab3.netlify.com/docs/homepage\"],\"selectors\":{\"lvl0\":\"#content header h1\",\"lvl1\":\"#content article h1\",\"lvl2\":\"#content section h3\",\"lvl3\":\"#content section h4\",\"lvl4\":\"#content section h5\",\"lvl5\":\"#content section h6\",\"text\":\"#content header p,#content section p,#content section ol\"}}" algolia/docsearch-scraper
You can generate this command by using jq
too.
This is unfortunately something users have already encountered and we keep this issue on our radar. However this is a really low priority right know.
Ohhh my God I spent more than two before finding your answer and when I found this I wass so happy but here is the response I got when execute your command @s-pace
`
C:\Users\computer\Documents\docusaurus-yt-example>docker run -it --env-file=.env -e "CONFIG={\"index_name\":\"treaty\",\"start_urls\":[\"https://competent-lalande-599ab3.netlify.com/docs/
homepage\"],\"selectors\":{\"lvl0\":\"#content header h1\",\"lvl1\":\"#content article h1\",\"lvl2\":\"#content section h3\",\"lvl3\":\"#content section h4\",\"lvl4\":\"#content section h
5\",\"lvl5\":\"#content section h6\",\"text\":\"#content header p,#content section p,#content section ol\"}}" algolia/docsearch-scraper
> Ignored: from start url https://competent-lalande-599ab3.netlify.app/docs/homepage
Crawling issue: nbHits 0 for treaty
C:\Users\computer\Documents\docusaurus-yt-example>
`
Glad to help. Are you sure the CSS selector you have defined match elements? Does your website require JS to be completely rendered? Let us know!
When you said CSS selector are you talking about this?
"selectors": {
"lvl0": "#content header h1",
"lvl1": "#content article h1",
"lvl2": "#content section h3",
"lvl3": "#content section h4",
"lvl4": "#content section h5",
"lvl5": "#content section h6",
"text": "#content header p,#content section p,#content section ol"
}
by watching this video on youtube everything's working well until here
Hey @sowousmane,
Here's a config I just made that seems to work for your website, however we are not able to retrieve all the URLs, updating your sitemap.xml
would solve this issue!
Does your website require JS to be completely rendered?
If your website requires some client-side rendering, make sure to also pass the js_render option
Yes I was about to write that. By curling your website you can see if the website requires JavaScript to be completely rendered. Otherwise the elements don't exist and thus the selectors match nothing.
Faced the same issue fixed with the config.json file and binding the ports locally to connect docker.
{
"index_name": "test",
"start_urls": [
"http://host.docker.internal:3000"
],
"selectors": {
"lvl0": {
"selector": "header h1, h1",
"type": "css",
"default_value": "Home"
},
"lvl1": {
"selector": "article h1, h2",
"type": "css"
},
"lvl2": {
"selector": "section h3, h4",
"type": "css"
},
"lvl3": {
"selector": "section h4, h5",
"type": "css"
},
"lvl4": {
"selector": "section h5, h6",
"type": "css"
},
"lvl5": {
"selector": "section h6",
"type": "css"
},
"text": {
"selector": "header p, section p, section ol",
"type": "css"
}
},
"strip_chars": " .,;:#",
"custom_settings": {
"separatorsToIndex": "_",
"attributesForFaceting": [
"type",
"section",
"level"
],
"attributesToRetrieve": [
"hierarchy",
"content",
"anchor",
"url",
"url_without_anchor",
"type"
]
},
"nb_hits": 5
}
To bind the ports this one worked :slight_smile:
python -m http.server --directory build --bind 0.0.0.0 3000
I don't know but it was hectic !
In the "Run Your Own" documentation page, There is a section for "Run the crawl from the Docker image".
I'm on a windows machine (obviously) and new to Docker and I could follow everything up to this point. But this line is for linux / mac users:
docker run -it --env-file=.env -e "CONFIG=$(cat /path/to/your/config.json | jq -r tostring)" algolia/docsearch-scraper
I tried various ways to be clever and tried something like this:
docker run -it --env-file=.env -e "CONFIG=$(jq -r tostring /path/to/your/config.json)" algolia/docsearch-scraper
Then I realised the problem is with Windows Command Line because it doesn't have an equivalent for $() expression. There may be a way as described on this page but I couldn't figure it out.
In the end I managed to get around it by putting the JSON config straight into the .env file. First by running:
jq -r tostring /path/to/your/config.json
Then copy-past the output from this into the .env such as:
CONFIG={"json":"config"}
I was hoping to find a tidy way to do and propose an update to the documentation but alas I post it to the issue forum instead. At least I found a work around.