Open shouari opened 6 months ago
Where exactly did you put the job settings?
here is the structure I used (the _settings is inside the documents_search folder)
.
├── config
│ └── documents_search
│ └── _settings.yaml
├── data
│ └── Files to index
├── logs
│ └── Empty folder so far
└── docker-compose.yml
└── .env
You need to change this line:
command: fscrawler doc_idx --restart --rest
To
command: fscrawler documents_search --restart --rest
Also note that you might have to change the name setting
name: "doc_idx"
To
name: "documents_search"
I did the mods above, but still the same error:
2024-03-26 16:06:59 20:06:59,762 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available
.
This is the debug:
2024-03-26 16:28:12 20:28:12,416 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-26 16:28:12 20:28:12,416 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-26 16:28:12 20:28:12,417 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-26 16:28:12 20:28:12,419 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-26 16:28:12 20:28:12,421 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-26 16:28:12 20:28:12,421 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.
Could you share the full logs and switch to trace mode?
How to start Trace mode? here is all the logs I could find:
2024-03-26 16:47:22 20:47:22,025 WARN [f.p.e.c.f.c.FsCrawlerCli] --debug option has been deprecated. Use FS_JAVA_OPTS="-DLOG_LEVEL=debug" instead.
2024-03-26 16:47:22 20:47:22,118 INFO [f.console] ,----------------------------------------------------------------------------------------------------.
2024-03-26 16:47:22 | ,---,. .--.--. ,----.. ,--, 2.10-SNAPSHOT |
2024-03-26 16:47:22 | ,' .' | / / '. / / \ ,--.'| |
2024-03-26 16:47:22 | ,---.' || : /`. / | : : __ ,-. .---.| | : __ ,-. |
2024-03-26 16:47:22 | | | .'; | |--` . | ;. /,' ,'/ /| /. ./|: : ' ,' ,'/ /| |
2024-03-26 16:47:22 | : : : | : ;_ . ; /--` ' | |' | ,--.--. .-'-. ' || ' | ,---. ' | |' | |
2024-03-26 16:47:22 | : | |-, \ \ `. ; | ; | | ,'/ \ /___/ \: |' | | / \ | | ,' |
2024-03-26 16:47:22 | | : ;/| `----. \| : | ' : / .--. .-. | .-'.. ' ' .| | : / / |' : / |
2024-03-26 16:47:22 | | | .' __ \ \ |. | '___ | | ' \__\/: . ./___/ \: '' : |__ . ' / || | ' |
2024-03-26 16:47:22 | ' : ' / /`--' /' ; : .'|; : | ," .--.; |. \ ' .\ | | '.'|' ; /|; : | |
2024-03-26 16:47:22 | | | | '--'. / ' | '/ :| , ; / / ,. | \ \ ' \ |; : ;' | / || , ; |
2024-03-26 16:47:22 | | : \ `--'---' | : / ---' ; : .' \ \ \ |--" | , / | : | ---' |
2024-03-26 16:47:22 | | | ,' \ \ .' | , .-./ \ \ | ---`-' \ \ / |
2024-03-26 16:47:22 | `----' `---` `--`---' '---" `----' |
2024-03-26 16:47:22 +----------------------------------------------------------------------------------------------------+
2024-03-26 16:47:22 | You know, for Files! |
2024-03-26 16:47:22 | Made from France with Love |
2024-03-26 16:47:22 | Source: https://github.com/dadoonet/fscrawler/ |
2024-03-26 16:47:22 | Documentation: https://fscrawler.readthedocs.io/ |
2024-03-26 16:47:22 `----------------------------------------------------------------------------------------------------'
2024-03-26 16:47:22
2024-03-26 16:47:22 20:47:22,142 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [94mb/1.9gb=4.81%], RAM [165.1mb/7.6gb=2.11%], Swap [1.7gb/2gb=87.61%].
2024-03-26 16:47:22 20:47:22,144 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-26 16:47:22 20:47:22,157 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-26 16:47:22 20:47:22,157 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-26 16:47:22 20:47:22,159 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-26 16:47:22 20:47:22,160 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-26 16:47:22 20:47:22,160 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.
May be try this:
command: fscrawler documents_search --trace --restart --rest
Here is the output for the trace command:
2024-03-27 08:56:31 12:56:31,820 WARN [f.p.e.c.f.c.FsCrawlerCli] --trace option has been deprecated. Use FS_JAVA_OPTS="-DLOG_LEVEL=trace" instead.
2024-03-27 08:56:31 12:56:31,852 INFO [f.console] ,----------------------------------------------------------------------------------------------------.
2024-03-27 08:56:31 | ,---,. .--.--. ,----.. ,--, 2.10-SNAPSHOT |
2024-03-27 08:56:31 | ,' .' | / / '. / / \ ,--.'| |
2024-03-27 08:56:31 | ,---.' || : /`. / | : : __ ,-. .---.| | : __ ,-. |
2024-03-27 08:56:31 | | | .'; | |--` . | ;. /,' ,'/ /| /. ./|: : ' ,' ,'/ /| |
2024-03-27 08:56:31 | : : : | : ;_ . ; /--` ' | |' | ,--.--. .-'-. ' || ' | ,---. ' | |' | |
2024-03-27 08:56:31 | : | |-, \ \ `. ; | ; | | ,'/ \ /___/ \: |' | | / \ | | ,' |
2024-03-27 08:56:31 | | : ;/| `----. \| : | ' : / .--. .-. | .-'.. ' ' .| | : / / |' : / |
2024-03-27 08:56:31 | | | .' __ \ \ |. | '___ | | ' \__\/: . ./___/ \: '' : |__ . ' / || | ' |
2024-03-27 08:56:31 | ' : ' / /`--' /' ; : .'|; : | ," .--.; |. \ ' .\ | | '.'|' ; /|; : | |
2024-03-27 08:56:31 | | | | '--'. / ' | '/ :| , ; / / ,. | \ \ ' \ |; : ;' | / || , ; |
2024-03-27 08:56:31 | | : \ `--'---' | : / ---' ; : .' \ \ \ |--" | , / | : | ---' |
2024-03-27 08:56:31 | | | ,' \ \ .' | , .-./ \ \ | ---`-' \ \ / |
2024-03-27 08:56:31 | `----' `---` `--`---' '---" `----' |
2024-03-27 08:56:31 +----------------------------------------------------------------------------------------------------+
2024-03-27 08:56:31 | You know, for Files! |
2024-03-27 08:56:31 | Made from France with Love |
2024-03-27 08:56:31 | Source: https://github.com/dadoonet/fscrawler/ |
2024-03-27 08:56:31 | Documentation: https://fscrawler.readthedocs.io/ |
2024-03-27 08:56:31 `----------------------------------------------------------------------------------------------------'
2024-03-27 08:56:31
2024-03-27 08:56:31 12:56:31,868 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [93.8mb/1.9gb=4.8%], RAM [874.1mb/7.6gb=11.18%], Swap [1.9gb/2gb=99.46%].
2024-03-27 08:56:31 12:56:31,870 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-27 08:56:31 12:56:31,871 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-27 08:56:31 12:56:31,871 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-27 08:56:31 12:56:31,872 TRACE [f.p.e.c.f.f.MetaFileHandler] Removing file _status.json from /root/.fscrawler/documents_search if exists
2024-03-27 08:56:31 12:56:31,872 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-27 08:56:31 12:56:31,873 TRACE [f.p.e.c.f.f.MetaFileHandler] Reading file _settings.yaml from /root/.fscrawler/documents_search
2024-03-27 08:56:31 12:56:31,874 TRACE [f.p.e.c.f.f.MetaFileHandler] Reading file _settings.json from /root/.fscrawler/documents_search
2024-03-27 08:56:31 12:56:31,874 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-27 08:56:31 12:56:31,874 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.
Could you share again:
./config/documents_search/_settings.yaml
file./docker-compose.yml
fileThanks
Sure here they are:
_settings.yaml
---
name: "documents_search"
fs:
url: "C:\Users\shouari\Documents\Documentation_es\data"
indexed_chars: 100%
lang_detect: true
continue_on_error: true
ocr:
language: "eng+fra"
enabled: true
pdf_strategy: "ocr_and_text"
elasticsearch:
nodes:
- url: "https://elasticsearch:9200"
username: "elastic"
password: "a123456"
ssl_verification: false
rest :
url: "http://fscrawler:8080"
docker-compose.yaml
'version: "2.2"
services:
setup:
image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
volumes:
- certs:/usr/share/elasticsearch/config/certs
user: "0"
command: >
bash -c '
if [ x${ELASTIC_PASSWORD} == x ]; then
echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
exit 1;
elif [ x${KIBANA_PASSWORD} == x ]; then
echo "Set the KIBANA_PASSWORD environment variable in the .env file";
exit 1;
fi;
if [ ! -f certs/ca.zip ]; then
echo "Creating CA";
bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
unzip config/certs/ca.zip -d config/certs;
fi;
if [ ! -f certs/certs.zip ]; then
echo "Creating certs";
echo -ne \
"instances:\n"\
" - name: elasticsearch\n"\
" dns:\n"\
" - elasticsearch\n"\
" - localhost\n"\
" ip:\n"\
" - 127.0.0.1\n"\
> config/certs/instances.yml;
bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
unzip config/certs/certs.zip -d config/certs;
fi;
echo "Setting file permissions"
chown -R root:root config/certs;
find . -type d -exec chmod 750 \{\} \;;
find . -type f -exec chmod 640 \{\} \;;
echo "Waiting for Elasticsearch availability";
until curl -s --cacert config/certs/ca/ca.crt https://elasticsearch:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
echo "Setting kibana_system password";
until curl -s -X POST --cacert config/certs/ca/ca.crt -u elastic:${ELASTIC_PASSWORD} -H "Content-Type: application/json" https://elasticsearch:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
echo "All done!";
'
healthcheck:
test: ["CMD-SHELL", "[ -f config/certs/elasticsearch/elasticsearch.crt ]"]
interval: 1s
timeout: 5s
retries: 120
elasticsearch:
depends_on:
setup:
condition: service_healthy
image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
volumes:
- certs:/usr/share/elasticsearch/config/certs
- esdata:/usr/share/elasticsearch/data
ports:
- ${ES_PORT}:9200
environment:
- node.name=elasticsearch
- cluster.name=${CLUSTER_NAME}
- cluster.initial_master_nodes=elasticsearch
- ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
- bootstrap.memory_lock=true
- xpack.security.enabled=true
- xpack.security.http.ssl.enabled=true
- xpack.security.http.ssl.key=certs/elasticsearch/elasticsearch.key
- xpack.security.http.ssl.certificate=certs/elasticsearch/elasticsearch.crt
- xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
- xpack.security.http.ssl.verification_mode=certificate
- xpack.security.transport.ssl.enabled=true
- xpack.security.transport.ssl.key=certs/elasticsearch/elasticsearch.key
- xpack.security.transport.ssl.certificate=certs/elasticsearch/elasticsearch.crt
- xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
- xpack.security.transport.ssl.verification_mode=certificate
- xpack.license.self_generated.type=${LICENSE}
mem_limit: ${MEM_LIMIT}
ulimits:
memlock:
soft: -1
hard: -1
healthcheck:
test:
[
"CMD-SHELL",
"curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
]
interval: 10s
timeout: 10s
retries: 120
kibana:
depends_on:
elasticsearch:
condition: service_healthy
image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
volumes:
- certs:/usr/share/kibana/config/certs
- kibanadata:/usr/share/kibana/data
ports:
- ${KIBANA_PORT}:5601
environment:
- SERVERNAME=kibana
- ELASTICSEARCH_HOSTS=https://elasticsearch:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
- ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
- ENTERPRISESEARCH_HOST=http://localhost:${ENTERPRISE_SEARCH_PORT}
mem_limit: ${MEM_LIMIT}
healthcheck:
test:
[
"CMD-SHELL",
"curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
]
interval: 10s
timeout: 10s
retries: 120
'# FSCrawler
fscrawler:
image: dadoonet/fscrawler:$FSCRAWLER_VERSION
container_name: fscrawler
restart: always
volumes:
- ../../test-documents/src/main/resources/documents/:/tmp/es:ro
- ${PWD}/config:/root/.fscrawler
- ${PWD}/logs:/usr/share/fscrawler/logs
- ${PWD}/external:/usr/share/fscrawler/external
depends_on:
elasticsearch:
condition: service_healthy
ports:
- ${FSCRAWLER_PORT}:8080
command: fscrawler documents_search --trace --restart --rest
volumes:
certs:
driver: local
'# enterprisesearchdata:
' # driver: local
esdata:
driver: local
kibanadata:
driver: local
Try this:
---
name: "documents_search"
fs:
url: "/tmp/es"
indexed_chars: 100%
lang_detect: true
continue_on_error: true
ocr:
language: "eng+fra"
enabled: true
pdf_strategy: "ocr_and_text"
elasticsearch:
nodes:
- url: "https://elasticsearch:9200"
username: "elastic"
password: "a123456"
ssl_verification: false
rest :
url: "http://fscrawler:8080"
And change this volume to mount your document folder instead:
- ../../test-documents/src/main/resources/documents/:/tmp/es:ro
If it does not work, please inspect your container and check that /root/.fscrawler/config
has the documents_search
dir which has the _settings.yaml
file.
I still face the same issue,
I'll have a look at the /root/.fscrawler/config
as suggested and check for the documents_search
folder..
If it does not exist, what might be the cause according to you?
While running the FSCrawler o via docker compose, I face this error.
2024-03-25 17:47:19 21:47:19,717 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [doc_idx] does not exist. Exiting as we are in silent mode or no input available.
Here is
_settings
:and here is the
docker-compose
file fscrawler sectionCan you please help with this?