Open Nurech opened 1 year ago
Realized the error in ${PWD}
and job_name
should have been idx
.
What kind of error was that for PWD
?
Maybe I was wrong. I assumed it was just a syntax issue, but I guess not.
fscrawler | Exception in thread "main" java.util.NoSuchElementException
fscrawler | at java.base/java.util.Scanner.throwFor(Scanner.java:937)
fscrawler | at java.base/java.util.Scanner.next(Scanner.java:1478)
fscrawler | at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:254)
.\config\idx\_settings.yaml
---
name: "idx"
fs:
url: "/tmp/es"
indexed_chars: 100
lang_detect: true
continue_on_error: true
index_folders: true
update_rate: "5s"
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
elasticsearch:
nodes:
- url: "https://elasticsearch:9200"
username: "elastic"
password: "changeme"
ssl_verification: true
rest :
url: "http://fscrawler:8080")
.\docker-compose.yml
---
version: "2.2"
services:
setup:
image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
volumes:
- certs:/usr/share/elasticsearch/config/certs
user: "0"
command: >
bash -c '
if [ x${ELASTIC_PASSWORD} == x ]; then
echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
exit 1;
elif [ x${KIBANA_PASSWORD} == x ]; then
echo "Set the KIBANA_PASSWORD environment variable in the .env file";
exit 1;
fi;
if [ ! -f certs/ca.zip ]; then
echo "Creating CA";
bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
unzip config/certs/ca.zip -d config/certs;
fi;
if [ ! -f certs/certs.zip ]; then
echo "Creating certs";
echo -ne \
"instances:\n"\
" - name: elasticsearch\n"\
" dns:\n"\
" - elasticsearch\n"\
" - localhost\n"\
" ip:\n"\
" - 127.0.0.1\n"\
> config/certs/instances.yml;
bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
unzip config/certs/certs.zip -d config/certs;
fi;
echo "Setting file permissions"
chown -R root:root config/certs;
find . -type d -exec chmod 750 \{\} \;;
find . -type f -exec chmod 640 \{\} \;;
echo "Waiting for Elasticsearch availability";
until curl -s --cacert config/certs/ca/ca.crt https://elasticsearch:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
echo "Setting kibana_system password";
until curl -s -X POST --cacert config/certs/ca/ca.crt -u elastic:${ELASTIC_PASSWORD} -H "Content-Type: application/json" https://elasticsearch:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
echo "All done!";
'
healthcheck:
test: ["CMD-SHELL", "[ -f config/certs/elasticsearch/elasticsearch.crt ]"]
interval: 1s
timeout: 5s
retries: 120
elasticsearch:
depends_on:
setup:
condition: service_healthy
image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
volumes:
- certs:/usr/share/elasticsearch/config/certs
- esdata:/usr/share/elasticsearch/data
ports:
- ${ES_PORT}:9200
environment:
- node.name=elasticsearch
- cluster.name=${CLUSTER_NAME}
- cluster.initial_master_nodes=elasticsearch
- ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
- bootstrap.memory_lock=true
- xpack.security.enabled=true
- xpack.security.http.ssl.enabled=true
- xpack.security.http.ssl.key=certs/elasticsearch/elasticsearch.key
- xpack.security.http.ssl.certificate=certs/elasticsearch/elasticsearch.crt
- xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
- xpack.security.http.ssl.verification_mode=certificate
- xpack.security.transport.ssl.enabled=true
- xpack.security.transport.ssl.key=certs/elasticsearch/elasticsearch.key
- xpack.security.transport.ssl.certificate=certs/elasticsearch/elasticsearch.crt
- xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
- xpack.security.transport.ssl.verification_mode=certificate
- xpack.license.self_generated.type=${LICENSE}
mem_limit: ${MEM_LIMIT}
ulimits:
memlock:
soft: -1
hard: -1
healthcheck:
test:
[
"CMD-SHELL",
"curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
]
interval: 10s
timeout: 10s
retries: 120
kibana:
depends_on:
elasticsearch:
condition: service_healthy
image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
volumes:
- certs:/usr/share/kibana/config/certs
- kibanadata:/usr/share/kibana/data
ports:
- ${KIBANA_PORT}:5601
environment:
- SERVERNAME=kibana
- ELASTICSEARCH_HOSTS=https://elasticsearch:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
- ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
- ENTERPRISESEARCH_HOST=http://enterprisesearch:${ENTERPRISE_SEARCH_PORT}
mem_limit: ${MEM_LIMIT}
healthcheck:
test:
[
"CMD-SHELL",
"curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
]
interval: 10s
timeout: 10s
retries: 120
# FSCrawler
fscrawler:
image: dadoonet/fscrawler
container_name: fscrawler
restart: always
volumes:
- .\config:/root/.fscrawler
- .\logs:/usr/share/fscrawler/logs
- .\data:/tmp/es:ro
depends_on:
elasticsearch:
condition: service_healthy
ports:
- ${FSCRAWLER_PORT}:8080
command: fscrawler idx --restart --rest
volumes:
certs:
driver: local
esdata:
driver: local
kibanadata:
driver: local
.\.env
STACK_VERSION=8.5.1
KIBANA_PORT=5601
ES_PORT=9200
FSCRAWLER_PORT=8080
FSCRAWLER_VERSION=latest
ELASTIC_PASSWORD=changeme
KIBANA_PASSWORD=changeme
CLUSTER_NAME=elastic
LICENSE=trial
.\start.bat
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.5.1
docker pull docker.elastic.co/kibana/kibana:8.5.1
docker pull dadoonet/fscrawler
docker-compose up -d
fscrawler image in docker
PATH
/usr/local/openjdk-17/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
JAVA_HOME
/usr/local/openjdk-17
LANG
C.UTF-8
JAVA_VERSION
17.0.2
Mounts
/TMP/ES
/host_mnt/c/Users/xxx/fscrawler/data
/USR/SHARE/FSCRAWLER/LOGS
/host_mnt/c/Users/xxx/fscrawler/logs
/ROOT/.FSCRAWLER
/host_mnt/c/Users/xxx/fscrawler/config
Port
8080/tcp
localhost:8080
Could you try with:
image: dadoonet/fscrawler:2.10-SNAPSHOT
And
docker pull dadoonet/fscrawler:2.10-SNAPSHOT
EDIT: actually no. You are using the latest build which should be good.
This looks weird to me /ROOT/.FSCRAWLER
. All is in uppercase where FSCrawler expects /root/.fscrawler
...
I'll look into it. But thats how docker gives it. Maybe there's a way to control capitalization.
Docker: 2.3.0.5
I will try to upgrade docker.
After upgrading docker I got additional messages:
.\start.bat
8.5.1: Pulling from elasticsearch/elasticsearch
Digest: sha256:d784066422aec9f66ae424f692d2416057e78853ab015915a04530570c955cc8
Status: Image is up to date for docker.elastic.co/elasticsearch/elasticsearch:8.5.1
docker.elastic.co/elasticsearch/elasticsearch:8.5.1
8.5.1: Pulling from kibana/kibana
Digest: sha256:3266a417b69207dab8da9a732d93c11512944f2ec88a9cd169bfbb0d6fd878f5
Status: Image is up to date for docker.elastic.co/kibana/kibana:8.5.1
docker.elastic.co/kibana/kibana:8.5.1
Using default tag: latest
latest: Pulling from dadoonet/fscrawler
Digest: sha256:89ef0cc6abd8825d33e993c2532c14e38841b5f09134609cbb3d3b8adfd79117
Status: Image is up to date for dadoonet/fscrawler:latest
docker.io/dadoonet/fscrawler:latest
time="2022-11-26T13:37:52+02:00" level=warning msg="The \"MEM_LIMIT\" variable is not set. Defaulting to a blank string."
error while interpolating services.elasticsearch.mem_limit: failed to cast to expected type: invalid size: ''
So I added MEM_LIMIT var to .env
STACK_VERSION=8.5.1
KIBANA_PORT=5601
ES_PORT=9200
FSCRAWLER_PORT=8080
FSCRAWLER_VERSION=latest
ELASTIC_PASSWORD=changeme
KIBANA_PASSWORD=changeme
CLUSTER_NAME=elastic
LICENSE=trial
MEM_LIMIT=4000m
Now elasticsearch is unable to start with little information.
ERROR: Elasticsearch exited unexpectedly
So had to make sure Docker has more mem then MEM_LIMIT going to Docker settings.
Gave Docker 10Gb.
Now everything fires, up but the problem persists.
[+] Running 5/5
- Network fscrawler_default Created 0.1s
- Container fscrawler-setup-1 Healthy 2.4s
- Container fscrawler-elasticsearch-1 Healthy 33.8s
- Container fscrawler Started 34.3s
- Container fscrawler-kibana-1 Started 34.2s
fscrawler docker image
2022-11-26 13:52:43 Exception in thread "main" java.util.NoSuchElementException
2022-11-26 13:52:43 at java.base/java.util.Scanner.throwFor(Scanner.java:937)
2022-11-26 13:52:43 at java.base/java.util.Scanner.next(Scanner.java:1478)
I swapped out _settings.yaml
, I had some typos there then it started booting up.
Now fscrawler is running, but giving other messages. But I suspect these are more warnings rather than errors.
2022-11-26 19:10:01 SLF4J: No SLF4J providers were found.
2022-11-26 19:10:01 SLF4J: Defaulting to no-operation (NOP) logger implementation
2022-11-26 19:10:01 SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
2022-11-26 19:10:01 SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
2022-11-26 19:10:01 SLF4J: Ignoring binding found at [jar:file:/usr/share/fscrawler/lib/log4j-slf4j-impl-2.19.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2022-11-26 19:10:01 SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
After that nothing happens, not sure how to debug this.w
Edit:
Going manually to container terminal I am able to start fscrawler indexing
# bash bin/fscrawler idx --restart --rest
17:23:46,664 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [145.3mb/2.4gb=5.69%], RAM [3gb/9.9gb=30.09%], Swap [1023.9mb/1023.9mb=100.0%].
17:23:46,674 INFO [f.console] No job specified. Here is the list of existing jobs:
17:23:46,697 INFO [f.console] [1] - idx
17:23:46,699 INFO [f.console] Choose your job [1-1]...
1
17:24:03,773 WARN [f.p.e.c.f.c.FsCrawlerCli] `url` is not set. Please define it. Falling back to default: [/tmp/es].
17:24:03,851 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
17:24:03,853 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
17:24:04,215 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/share/fscrawler/lib/log4j-slf4j-impl-2.19.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
17:24:05,751 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.5.1
17:24:05,777 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
17:24:06,023 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.5.1
17:24:06,134 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [idx] for [/tmp/es] every [15m]
17:24:06,404 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
Docker composer doesn't pass commands properly?
I am not sure how to get Docker compose -command
to work. But I am leaving this job here in case anyone else has issues.
---
name: "idx"
fs:
url: "/tmp/es"
indexed_chars: 100%
lang_detect: true
continue_on_error: true
update_rate: "5s"
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
elasticsearch:
nodes:
- url: "https://elasticsearch:9200"
username: "elastic"
password: "changeme"
ssl_verification: false
rest :
url: "http://fscrawler:8080"
# use this to start job from Docker terminal manually
# fscrawler idx --rest --restart --debug
Describe the bug
I am following the standard setup guide, but fail to get fscrawler started. I call
docker-compose up -d
from "masters". Elastic and Kibana fire up. I have created a folder "masters" on Desktop with the where "masters" has the following layout:.env
.docker-compose.yaml (everything else is standard)
JAVA_HOME: C:\Program Files\Java\jdk-11.0.11
Job Settings
Logs
Expected behavior
Expecting fscrawler to work and use masters/data folder.
Versions: