dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.32k stars 295 forks source link

Windows Docker compose #1545

Open Nurech opened 1 year ago

Nurech commented 1 year ago

Describe the bug

I am following the standard setup guide, but fail to get fscrawler started. I call docker-compose up -d from "masters". Elastic and Kibana fire up. I have created a folder "masters" on Desktop with the where "masters" has the following layout:

.
├── config
│   └── job_name
│       └── _settings.yaml
├── data
│   └── <your files>
├── logs
│   └── <fscrawler logs>
└── docker-compose.yml
└── .env

.env

STACK_VERSION=8.5.1
KIBANA_PORT=5601
ES_PORT=9200
FSCRAWLER_PORT=8080
FSCRAWLER_VERSION=latest
ELASTIC_PASSWORD=changeme
KIBANA_PASSWORD=changeme
CLUSTER_NAME=elastic
LICENSE=trial
PWD=C:\\Users\\xxx\\Desktop\\masters

.docker-compose.yaml (everything else is standard)

 # FSCrawler
  fscrawler:
    image: dadoonet/fscrawler:$FSCRAWLER_VERSION
    container_name: fscrawler
    restart: always
    volumes:
      - ${PWD}/data
      - ${PWD}/config
      - ${PWD}/logs
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - ${FSCRAWLER_PORT}:8080
    command: fscrawler idx --restart --rest

JAVA_HOME: C:\Program Files\Java\jdk-11.0.11

Job Settings

---
name: "idx"
fs:
  url: "C:\\Users\\xxx\\Desktop\\masters"
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "https://localhost:9200"
  username: "elastic"
  password: "changeme"
  ssl_verification: false
rest :
  url: "http://localhost:8080"

Logs

Exception in thread "main" java.util.NoSuchElementException

at java.base/java.util.Scanner.throwFor(Scanner.java:937)

at java.base/java.util.Scanner.next(Scanner.java:1478)

at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:254)

Exception in thread "main" java.util.NoSuchElementException

at java.base/java.util.Scanner.throwFor(Scanner.java:937)

at java.base/java.util.Scanner.next(Scanner.java:1478)

at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:254)

Exception in thread "main" java.util.NoSuchElementException

at java.base/java.util.Scanner.throwFor(Scanner.java:937)

at java.base/java.util.Scanner.next(Scanner.java:1478)
...

Expected behavior

Expecting fscrawler to work and use masters/data folder.

Versions:

Nurech commented 1 year ago

Realized the error in ${PWD} and job_name should have been idx.

dadoonet commented 1 year ago

What kind of error was that for PWD?

Nurech commented 1 year ago

Maybe I was wrong. I assumed it was just a syntax issue, but I guess not.

fscrawler | Exception in thread "main" java.util.NoSuchElementException
fscrawler | at java.base/java.util.Scanner.throwFor(Scanner.java:937)
fscrawler | at java.base/java.util.Scanner.next(Scanner.java:1478)
fscrawler | at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:254)

.\config\idx\_settings.yaml

--- 
name: "idx" 
fs: 
  url: "/tmp/es" 
  indexed_chars: 100 
  lang_detect: true 
  continue_on_error: true 
  index_folders: true 
  update_rate: "5s" 
  ocr: 
    language: "eng" 
    enabled: true 
    pdf_strategy: "ocr_and_text" 
elasticsearch: 
  nodes: 
    - url: "https://elasticsearch:9200" 
  username: "elastic" 
  password: "changeme" 
  ssl_verification: true 
rest : 
  url: "http://fscrawler:8080")  

.\docker-compose.yml

---
version: "2.2"

services:
  setup:
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
    user: "0"
    command: >
      bash -c '
        if [ x${ELASTIC_PASSWORD} == x ]; then
          echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
          exit 1;
        elif [ x${KIBANA_PASSWORD} == x ]; then
          echo "Set the KIBANA_PASSWORD environment variable in the .env file";
          exit 1;
        fi;
        if [ ! -f certs/ca.zip ]; then
          echo "Creating CA";
          bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
          unzip config/certs/ca.zip -d config/certs;
        fi;
        if [ ! -f certs/certs.zip ]; then
          echo "Creating certs";
          echo -ne \
          "instances:\n"\
          "  - name: elasticsearch\n"\
          "    dns:\n"\
          "      - elasticsearch\n"\
          "      - localhost\n"\
          "    ip:\n"\
          "      - 127.0.0.1\n"\
          > config/certs/instances.yml;
          bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
          unzip config/certs/certs.zip -d config/certs;
        fi;
        echo "Setting file permissions"
        chown -R root:root config/certs;
        find . -type d -exec chmod 750 \{\} \;;
        find . -type f -exec chmod 640 \{\} \;;
        echo "Waiting for Elasticsearch availability";
        until curl -s --cacert config/certs/ca/ca.crt https://elasticsearch:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
        echo "Setting kibana_system password";
        until curl -s -X POST --cacert config/certs/ca/ca.crt -u elastic:${ELASTIC_PASSWORD} -H "Content-Type: application/json" https://elasticsearch:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
        echo "All done!";
      '
    healthcheck:
      test: ["CMD-SHELL", "[ -f config/certs/elasticsearch/elasticsearch.crt ]"]
      interval: 1s
      timeout: 5s
      retries: 120

  elasticsearch:
    depends_on:
      setup:
        condition: service_healthy
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
      - esdata:/usr/share/elasticsearch/data
    ports:
      - ${ES_PORT}:9200
    environment:
      - node.name=elasticsearch
      - cluster.name=${CLUSTER_NAME}
      - cluster.initial_master_nodes=elasticsearch
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - bootstrap.memory_lock=true
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=true
      - xpack.security.http.ssl.key=certs/elasticsearch/elasticsearch.key
      - xpack.security.http.ssl.certificate=certs/elasticsearch/elasticsearch.crt
      - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.http.ssl.verification_mode=certificate
      - xpack.security.transport.ssl.enabled=true
      - xpack.security.transport.ssl.key=certs/elasticsearch/elasticsearch.key
      - xpack.security.transport.ssl.certificate=certs/elasticsearch/elasticsearch.crt
      - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.transport.ssl.verification_mode=certificate
      - xpack.license.self_generated.type=${LICENSE}
    mem_limit: ${MEM_LIMIT}
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120

  kibana:
    depends_on:
      elasticsearch:
        condition: service_healthy
    image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
    volumes:
      - certs:/usr/share/kibana/config/certs
      - kibanadata:/usr/share/kibana/data
    ports:
      - ${KIBANA_PORT}:5601
    environment:
      - SERVERNAME=kibana
      - ELASTICSEARCH_HOSTS=https://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
      - ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
      - ENTERPRISESEARCH_HOST=http://enterprisesearch:${ENTERPRISE_SEARCH_PORT}
    mem_limit: ${MEM_LIMIT}
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120

  # FSCrawler
  fscrawler:
    image: dadoonet/fscrawler
    container_name: fscrawler
    restart: always
    volumes:
      - .\config:/root/.fscrawler
      - .\logs:/usr/share/fscrawler/logs
      - .\data:/tmp/es:ro
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - ${FSCRAWLER_PORT}:8080
    command: fscrawler idx --restart --rest

volumes:
  certs:
    driver: local
  esdata:
    driver: local
  kibanadata:
    driver: local

.\.env

STACK_VERSION=8.5.1
KIBANA_PORT=5601
ES_PORT=9200
FSCRAWLER_PORT=8080
FSCRAWLER_VERSION=latest
ELASTIC_PASSWORD=changeme
KIBANA_PASSWORD=changeme
CLUSTER_NAME=elastic
LICENSE=trial

.\start.bat

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.5.1
docker pull docker.elastic.co/kibana/kibana:8.5.1
docker pull dadoonet/fscrawler
docker-compose up -d

fscrawler image in docker

PATH
/usr/local/openjdk-17/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

JAVA_HOME
/usr/local/openjdk-17

LANG
C.UTF-8

JAVA_VERSION
17.0.2

Mounts
/TMP/ES
/host_mnt/c/Users/xxx/fscrawler/data

/USR/SHARE/FSCRAWLER/LOGS
/host_mnt/c/Users/xxx/fscrawler/logs

/ROOT/.FSCRAWLER
/host_mnt/c/Users/xxx/fscrawler/config

Port
8080/tcp
localhost:8080
dadoonet commented 1 year ago

Could you try with:

image: dadoonet/fscrawler:2.10-SNAPSHOT

And

docker pull dadoonet/fscrawler:2.10-SNAPSHOT

EDIT: actually no. You are using the latest build which should be good.

dadoonet commented 1 year ago

This looks weird to me /ROOT/.FSCRAWLER. All is in uppercase where FSCrawler expects /root/.fscrawler...

Nurech commented 1 year ago

I'll look into it. But thats how docker gives it. Maybe there's a way to control capitalization.

Docker: 2.3.0.5

I will try to upgrade docker.

image

Nurech commented 1 year ago

After upgrading docker I got additional messages:

.\start.bat

8.5.1: Pulling from elasticsearch/elasticsearch
Digest: sha256:d784066422aec9f66ae424f692d2416057e78853ab015915a04530570c955cc8
Status: Image is up to date for docker.elastic.co/elasticsearch/elasticsearch:8.5.1
docker.elastic.co/elasticsearch/elasticsearch:8.5.1
8.5.1: Pulling from kibana/kibana
Digest: sha256:3266a417b69207dab8da9a732d93c11512944f2ec88a9cd169bfbb0d6fd878f5
Status: Image is up to date for docker.elastic.co/kibana/kibana:8.5.1
docker.elastic.co/kibana/kibana:8.5.1
Using default tag: latest
latest: Pulling from dadoonet/fscrawler
Digest: sha256:89ef0cc6abd8825d33e993c2532c14e38841b5f09134609cbb3d3b8adfd79117
Status: Image is up to date for dadoonet/fscrawler:latest
docker.io/dadoonet/fscrawler:latest
time="2022-11-26T13:37:52+02:00" level=warning msg="The \"MEM_LIMIT\" variable is not set. Defaulting to a blank string."
error while interpolating services.elasticsearch.mem_limit: failed to cast to expected type: invalid size: ''

So I added MEM_LIMIT var to .env

STACK_VERSION=8.5.1
KIBANA_PORT=5601
ES_PORT=9200
FSCRAWLER_PORT=8080
FSCRAWLER_VERSION=latest
ELASTIC_PASSWORD=changeme
KIBANA_PASSWORD=changeme
CLUSTER_NAME=elastic
LICENSE=trial
MEM_LIMIT=4000m

Now elasticsearch is unable to start with little information.

ERROR: Elasticsearch exited unexpectedly

So had to make sure Docker has more mem then MEM_LIMIT going to Docker settings.

Gave Docker 10Gb.

Now everything fires, up but the problem persists.

[+] Running 5/5
 - Network fscrawler_default            Created                                                                    0.1s
 - Container fscrawler-setup-1          Healthy                                                                    2.4s
 - Container fscrawler-elasticsearch-1  Healthy                                                                   33.8s
 - Container fscrawler                  Started                                                                   34.3s
 - Container fscrawler-kibana-1         Started                                                                   34.2s

fscrawler docker image

2022-11-26 13:52:43 Exception in thread "main" java.util.NoSuchElementException
2022-11-26 13:52:43     at java.base/java.util.Scanner.throwFor(Scanner.java:937)
2022-11-26 13:52:43     at java.base/java.util.Scanner.next(Scanner.java:1478)
Nurech commented 1 year ago

I swapped out _settings.yaml, I had some typos there then it started booting up.

Now fscrawler is running, but giving other messages. But I suspect these are more warnings rather than errors.

2022-11-26 19:10:01 SLF4J: No SLF4J providers were found.
2022-11-26 19:10:01 SLF4J: Defaulting to no-operation (NOP) logger implementation
2022-11-26 19:10:01 SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
2022-11-26 19:10:01 SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
2022-11-26 19:10:01 SLF4J: Ignoring binding found at [jar:file:/usr/share/fscrawler/lib/log4j-slf4j-impl-2.19.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-11-26 19:10:01 SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.

After that nothing happens, not sure how to debug this.w

Edit:

Going manually to container terminal I am able to start fscrawler indexing

# bash bin/fscrawler idx --restart --rest
17:23:46,664 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [145.3mb/2.4gb=5.69%], RAM [3gb/9.9gb=30.09%], Swap [1023.9mb/1023.9mb=100.0%].
17:23:46,674 INFO  [f.console] No job specified. Here is the list of existing jobs:
17:23:46,697 INFO  [f.console] [1] - idx
17:23:46,699 INFO  [f.console] Choose your job [1-1]...
1
17:24:03,773 WARN  [f.p.e.c.f.c.FsCrawlerCli] `url` is not set. Please define it. Falling back to default: [/tmp/es].
17:24:03,851 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
17:24:03,853 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
17:24:04,215 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/share/fscrawler/lib/log4j-slf4j-impl-2.19.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
17:24:05,751 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.5.1
17:24:05,777 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
17:24:06,023 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.5.1
17:24:06,134 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [idx] for [/tmp/es] every [15m]
17:24:06,404 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

Docker composer doesn't pass commands properly?

Nurech commented 1 year ago

I am not sure how to get Docker compose -command to work. But I am leaving this job here in case anyone else has issues.

---
name: "idx"
fs:
  url: "/tmp/es"
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  update_rate: "5s"
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "https://elasticsearch:9200"
  username: "elastic"
  password: "changeme"
  ssl_verification: false
rest :
  url: "http://fscrawler:8080"

  # use this to start job from Docker terminal manually
  # fscrawler idx --rest --restart --debug