dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

Support for noob_ FSCrawler on Docker compose #1843

Open shouari opened 6 months ago

shouari commented 6 months ago

While running the FSCrawler o via docker compose, I face this error. 2024-03-25 17:47:19 21:47:19,717 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [doc_idx] does not exist. Exiting as we are in silent mode or no input available.

Here is _settings:

name: "doc_idx"
fs:
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "https://elasticsearch:9200"
  username: "elastic"
  password: "a123456"
  ssl_verification: false
rest :
  url: "http://fscrawler:8080"

and here is the docker-compose file fscrawler section

fscrawler:
    image: dadoonet/fscrawler:$FSCRAWLER_VERSION
    container_name: fscrawler
    restart: always
    volumes:
      - ../../test-documents/src/main/resources/documents/:/tmp/es:ro
      - ${PWD}/config:/root/.fscrawler
      - ${PWD}/logs:/usr/share/fscrawler/logs
      - ${PWD}/external:/usr/share/fscrawler/external
    depends_on:
      elasticsearch:
         condition: service_healthy
    ports: 
      - ${FSCRAWLER_PORT}:8080
    command: fscrawler doc_idx --restart --rest

Can you please help with this?

dadoonet commented 6 months ago

Where exactly did you put the job settings?

shouari commented 6 months ago

here is the structure I used (the _settings is inside the documents_search folder)

.
├── config
│   └── documents_search
│              └── _settings.yaml
├── data
│   └── Files to index
├── logs
│   └── Empty folder so far
└── docker-compose.yml
└── .env
dadoonet commented 6 months ago

You need to change this line:

command: fscrawler doc_idx --restart --rest

To

command: fscrawler documents_search --restart --rest

Also note that you might have to change the name setting

name: "doc_idx"

To

name: "documents_search"
shouari commented 6 months ago

I did the mods above, but still the same error:

2024-03-26 16:06:59 20:06:59,762 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.

This is the debug:

2024-03-26 16:28:12 20:28:12,416 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-26 16:28:12 20:28:12,416 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-26 16:28:12 20:28:12,417 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-26 16:28:12 20:28:12,419 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-26 16:28:12 20:28:12,421 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-26 16:28:12 20:28:12,421 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.
dadoonet commented 6 months ago

Could you share the full logs and switch to trace mode?

shouari commented 6 months ago

How to start Trace mode? here is all the logs I could find:

2024-03-26 16:47:22 20:47:22,025 WARN  [f.p.e.c.f.c.FsCrawlerCli] --debug option has been deprecated. Use FS_JAVA_OPTS="-DLOG_LEVEL=debug" instead.
2024-03-26 16:47:22 20:47:22,118 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
2024-03-26 16:47:22 |       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
2024-03-26 16:47:22 |     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
2024-03-26 16:47:22 |   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
2024-03-26 16:47:22 |   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
2024-03-26 16:47:22 |   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
2024-03-26 16:47:22 |   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
2024-03-26 16:47:22 |   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
2024-03-26 16:47:22 |   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
2024-03-26 16:47:22 |   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
2024-03-26 16:47:22 |   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
2024-03-26 16:47:22 |   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
2024-03-26 16:47:22 |   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
2024-03-26 16:47:22 |   `----'                  `---`            `--`---'       '---"                `----'              |
2024-03-26 16:47:22 +----------------------------------------------------------------------------------------------------+
2024-03-26 16:47:22 |                                        You know, for Files!                                        |
2024-03-26 16:47:22 |                                     Made from France with Love                                     |
2024-03-26 16:47:22 |                           Source: https://github.com/dadoonet/fscrawler/                           |
2024-03-26 16:47:22 |                          Documentation: https://fscrawler.readthedocs.io/                          |
2024-03-26 16:47:22 `----------------------------------------------------------------------------------------------------'
2024-03-26 16:47:22 
2024-03-26 16:47:22 20:47:22,142 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [94mb/1.9gb=4.81%], RAM [165.1mb/7.6gb=2.11%], Swap [1.7gb/2gb=87.61%].
2024-03-26 16:47:22 20:47:22,144 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-26 16:47:22 20:47:22,157 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-26 16:47:22 20:47:22,157 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-26 16:47:22 20:47:22,159 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-26 16:47:22 20:47:22,160 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-26 16:47:22 20:47:22,160 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.
dadoonet commented 6 months ago

May be try this:

command: fscrawler documents_search --trace --restart --rest
shouari commented 6 months ago

Here is the output for the trace command:

2024-03-27 08:56:31 12:56:31,820 WARN  [f.p.e.c.f.c.FsCrawlerCli] --trace option has been deprecated. Use FS_JAVA_OPTS="-DLOG_LEVEL=trace" instead.
2024-03-27 08:56:31 12:56:31,852 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
2024-03-27 08:56:31 |       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
2024-03-27 08:56:31 |     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
2024-03-27 08:56:31 |   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
2024-03-27 08:56:31 |   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
2024-03-27 08:56:31 |   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
2024-03-27 08:56:31 |   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
2024-03-27 08:56:31 |   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
2024-03-27 08:56:31 |   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
2024-03-27 08:56:31 |   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
2024-03-27 08:56:31 |   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
2024-03-27 08:56:31 |   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
2024-03-27 08:56:31 |   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
2024-03-27 08:56:31 |   `----'                  `---`            `--`---'       '---"                `----'              |
2024-03-27 08:56:31 +----------------------------------------------------------------------------------------------------+
2024-03-27 08:56:31 |                                        You know, for Files!                                        |
2024-03-27 08:56:31 |                                     Made from France with Love                                     |
2024-03-27 08:56:31 |                           Source: https://github.com/dadoonet/fscrawler/                           |
2024-03-27 08:56:31 |                          Documentation: https://fscrawler.readthedocs.io/                          |
2024-03-27 08:56:31 `----------------------------------------------------------------------------------------------------'
2024-03-27 08:56:31 
2024-03-27 08:56:31 12:56:31,868 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [93.8mb/1.9gb=4.8%], RAM [874.1mb/7.6gb=11.18%], Swap [1.9gb/2gb=99.46%].
2024-03-27 08:56:31 12:56:31,870 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
2024-03-27 08:56:31 12:56:31,871 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
2024-03-27 08:56:31 12:56:31,871 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [documents_search]...
2024-03-27 08:56:31 12:56:31,872 TRACE [f.p.e.c.f.f.MetaFileHandler] Removing file _status.json from /root/.fscrawler/documents_search if exists
2024-03-27 08:56:31 12:56:31,872 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [documents_search]...
2024-03-27 08:56:31 12:56:31,873 TRACE [f.p.e.c.f.f.MetaFileHandler] Reading file _settings.yaml from /root/.fscrawler/documents_search
2024-03-27 08:56:31 12:56:31,874 TRACE [f.p.e.c.f.f.MetaFileHandler] Reading file _settings.json from /root/.fscrawler/documents_search
2024-03-27 08:56:31 12:56:31,874 DEBUG [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist.
2024-03-27 08:56:31 12:56:31,874 ERROR [f.p.e.c.f.c.FsCrawlerCli] job [documents_search] does not exist. Exiting as we are in silent mode or no input available.
dadoonet commented 5 months ago

Could you share again:

Thanks

shouari commented 5 months ago

Sure here they are:

_settings.yaml

---
name: "documents_search"
fs:
  url: "C:\Users\shouari\Documents\Documentation_es\data"
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  ocr:
    language: "eng+fra"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "https://elasticsearch:9200"
  username: "elastic"
  password: "a123456"
  ssl_verification: false
rest :
  url: "http://fscrawler:8080"

docker-compose.yaml

'version: "2.2"
services:

  setup:
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
    user: "0"
    command: >
      bash -c '
        if [ x${ELASTIC_PASSWORD} == x ]; then
          echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
          exit 1;
        elif [ x${KIBANA_PASSWORD} == x ]; then
          echo "Set the KIBANA_PASSWORD environment variable in the .env file";
          exit 1;
        fi;
        if [ ! -f certs/ca.zip ]; then
          echo "Creating CA";
          bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
          unzip config/certs/ca.zip -d config/certs;
        fi;
        if [ ! -f certs/certs.zip ]; then
          echo "Creating certs";
          echo -ne \
          "instances:\n"\
          "  - name: elasticsearch\n"\
          "    dns:\n"\
          "      - elasticsearch\n"\
          "      - localhost\n"\
          "    ip:\n"\
          "      - 127.0.0.1\n"\
          > config/certs/instances.yml;
          bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
          unzip config/certs/certs.zip -d config/certs;
        fi;
        echo "Setting file permissions"
        chown -R root:root config/certs;
        find . -type d -exec chmod 750 \{\} \;;
        find . -type f -exec chmod 640 \{\} \;;
        echo "Waiting for Elasticsearch availability";
        until curl -s --cacert config/certs/ca/ca.crt https://elasticsearch:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
        echo "Setting kibana_system password";
        until curl -s -X POST --cacert config/certs/ca/ca.crt -u elastic:${ELASTIC_PASSWORD} -H "Content-Type: application/json" https://elasticsearch:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
        echo "All done!";
      '
    healthcheck:
      test: ["CMD-SHELL", "[ -f config/certs/elasticsearch/elasticsearch.crt ]"]
      interval: 1s
      timeout: 5s
      retries: 120

  elasticsearch:
    depends_on:
      setup:
        condition: service_healthy
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
      - esdata:/usr/share/elasticsearch/data
    ports:
      - ${ES_PORT}:9200
    environment:
      - node.name=elasticsearch
      - cluster.name=${CLUSTER_NAME}
      - cluster.initial_master_nodes=elasticsearch
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - bootstrap.memory_lock=true
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=true
      - xpack.security.http.ssl.key=certs/elasticsearch/elasticsearch.key
      - xpack.security.http.ssl.certificate=certs/elasticsearch/elasticsearch.crt
      - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.http.ssl.verification_mode=certificate
      - xpack.security.transport.ssl.enabled=true
      - xpack.security.transport.ssl.key=certs/elasticsearch/elasticsearch.key
      - xpack.security.transport.ssl.certificate=certs/elasticsearch/elasticsearch.crt
      - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.transport.ssl.verification_mode=certificate
      - xpack.license.self_generated.type=${LICENSE}
    mem_limit: ${MEM_LIMIT}
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120

  kibana:
    depends_on:
      elasticsearch:
        condition: service_healthy
    image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
    volumes:
      - certs:/usr/share/kibana/config/certs
      - kibanadata:/usr/share/kibana/data
    ports:
      - ${KIBANA_PORT}:5601
    environment:
      - SERVERNAME=kibana
      - ELASTICSEARCH_HOSTS=https://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
      - ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
      - ENTERPRISESEARCH_HOST=http://localhost:${ENTERPRISE_SEARCH_PORT}
    mem_limit: ${MEM_LIMIT}
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120

  '# FSCrawler
  fscrawler:
    image: dadoonet/fscrawler:$FSCRAWLER_VERSION
    container_name: fscrawler
    restart: always
    volumes:
      - ../../test-documents/src/main/resources/documents/:/tmp/es:ro
      - ${PWD}/config:/root/.fscrawler
      - ${PWD}/logs:/usr/share/fscrawler/logs
      - ${PWD}/external:/usr/share/fscrawler/external
    depends_on:
      elasticsearch:
         condition: service_healthy
    ports: 
      - ${FSCRAWLER_PORT}:8080
    command: fscrawler documents_search --trace --restart --rest
volumes:
  certs:
    driver: local
  '# enterprisesearchdata:
 ' #   driver: local
  esdata:
    driver: local
  kibanadata:
    driver: local
dadoonet commented 5 months ago

Try this:

---
name: "documents_search"
fs:
  url: "/tmp/es"
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  ocr:
    language: "eng+fra"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "https://elasticsearch:9200"
  username: "elastic"
  password: "a123456"
  ssl_verification: false
rest :
  url: "http://fscrawler:8080"

And change this volume to mount your document folder instead:

- ../../test-documents/src/main/resources/documents/:/tmp/es:ro

If it does not work, please inspect your container and check that /root/.fscrawler/config has the documents_search dir which has the _settings.yaml file.

shouari commented 5 months ago

I still face the same issue,

I'll have a look at the /root/.fscrawler/config as suggested and check for the documents_search folder..

If it does not exist, what might be the cause according to you?