algolia / docsearch

:blue_book: The easiest way to add search to your documentation.
https://docsearch.algolia.com
MIT License
3.97k stars 384 forks source link

Run the crawl from the Docker image Windows #922

Closed Mitcal closed 4 years ago

Mitcal commented 4 years ago

In the "Run Your Own" documentation page, There is a section for "Run the crawl from the Docker image".

I'm on a windows machine (obviously) and new to Docker and I could follow everything up to this point. But this line is for linux / mac users:

docker run -it --env-file=.env -e "CONFIG=$(cat /path/to/your/config.json | jq -r tostring)" algolia/docsearch-scraper

I tried various ways to be clever and tried something like this:

docker run -it --env-file=.env -e "CONFIG=$(jq -r tostring /path/to/your/config.json)" algolia/docsearch-scraper

Then I realised the problem is with Windows Command Line because it doesn't have an equivalent for $() expression. There may be a way as described on this page but I couldn't figure it out.

In the end I managed to get around it by putting the JSON config straight into the .env file. First by running:

jq -r tostring /path/to/your/config.json

Then copy-past the output from this into the .env such as:

CONFIG={"json":"config"}

I was hoping to find a tidy way to do and propose an update to the documentation but alas I post it to the issue forum instead. At least I found a work around.

s-pace commented 4 years ago

👋 @Mitcal

Thank you for raising the issue. You find the issue and the workaround. Indeed, he command cat is not available on windows ...

What I would still recommend you is to inline the configuration just like the following example:

docker run -it --env-file=.env -e "CONFIG={\"index_name\":\"treaty\",\"start_urls\":[\"https://competent-lalande-599ab3.netlify.com/docs/homepage\"],\"selectors\":{\"lvl0\":\"#content header h1\",\"lvl1\":\"#content article h1\",\"lvl2\":\"#content section h3\",\"lvl3\":\"#content section h4\",\"lvl4\":\"#content section h5\",\"lvl5\":\"#content section h6\",\"text\":\"#content header p,#content section p,#content section ol\"}}" algolia/docsearch-scraper

You can generate this command by using jq too.

This is unfortunately something users have already encountered and we keep this issue on our radar. However this is a really low priority right know.

sowousmane commented 3 years ago

Ohhh my God I spent more than two before finding your answer and when I found this I wass so happy but here is the response I got when execute your command @s-pace

`


C:\Users\computer\Documents\docusaurus-yt-example>docker run -it --env-file=.env -e "CONFIG={\"index_name\":\"treaty\",\"start_urls\":[\"https://competent-lalande-599ab3.netlify.com/docs/
homepage\"],\"selectors\":{\"lvl0\":\"#content header h1\",\"lvl1\":\"#content article h1\",\"lvl2\":\"#content section h3\",\"lvl3\":\"#content section h4\",\"lvl4\":\"#content section h
5\",\"lvl5\":\"#content section h6\",\"text\":\"#content header p,#content section p,#content section ol\"}}" algolia/docsearch-scraper
> Ignored: from start url https://competent-lalande-599ab3.netlify.app/docs/homepage

Crawling issue: nbHits 0 for treaty

C:\Users\computer\Documents\docusaurus-yt-example>

`

s-pace commented 3 years ago

Glad to help. Are you sure the CSS selector you have defined match elements? Does your website require JS to be completely rendered? Let us know!

sowousmane commented 3 years ago

When you said CSS selector are you talking about this?

"selectors": {
    "lvl0": "#content header h1",
    "lvl1": "#content article h1",
    "lvl2": "#content section h3",
    "lvl3": "#content section h4",
    "lvl4": "#content section h5",
    "lvl5": "#content section h6",
    "text": "#content header p,#content section p,#content section ol"
  }
sowousmane commented 3 years ago

by watching this video on youtube everything's working well until here

shortcuts commented 3 years ago

Hey @sowousmane,

Here's a config I just made that seems to work for your website, however we are not able to retrieve all the URLs, updating your sitemap.xml would solve this issue!

Does your website require JS to be completely rendered?

If your website requires some client-side rendering, make sure to also pass the js_render option

s-pace commented 3 years ago

Yes I was about to write that. By curling your website you can see if the website requires JavaScript to be completely rendered. Otherwise the elements don't exist and thus the selectors match nothing.

Sahillather002 commented 1 month ago

Faced the same issue fixed with the config.json file and binding the ports locally to connect docker.

{
  "index_name": "test",
  "start_urls": [
    "http://host.docker.internal:3000"
  ],
  "selectors": {
    "lvl0": {
      "selector": "header h1, h1",
      "type": "css",
      "default_value": "Home"
    },
    "lvl1": {
      "selector": "article h1, h2",
      "type": "css"
    },
    "lvl2": {
      "selector": "section h3, h4",
      "type": "css"
    },
    "lvl3": {
      "selector": "section h4, h5",
      "type": "css"
    },
    "lvl4": {
      "selector": "section h5, h6",
      "type": "css"
    },
    "lvl5": {
      "selector": "section h6",
      "type": "css"
    },
    "text": {
      "selector": "header p, section p, section ol",
      "type": "css"
    }
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "type",
      "section",
      "level"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "nb_hits": 5
}

To bind the ports this one worked :slight_smile: python -m http.server --directory build --bind 0.0.0.0 3000

I don't know but it was hectic !