dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.36k stars 299 forks source link

Closing FS crawler / FS crawler thread is still running #1093

Open ian-cameron opened 3 years ago

ian-cameron commented 3 years ago

Describe the bug

Crawling a large directory, FS crawler appears to stop sometimes, with no error or stop message. There doesn't seem to be any particular file that causes it, and the job just appears to stop after a few hours in different places each time.

Job Settings

fs:
  url: "/mnt/projects/24"
  update_rate: "3h"
  indexed_chars: "-1"
  includes:
  - "*/*.pdf"
  - "*/*.doc"
  - "*/*.docx"
  - "*/*.ppt"
  - "*/*.pptx"
  - "*/*.odf"
  - "*/*.rtf"
  - "*/*.msg"
  - "*/*.pst"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: false
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  ignore_above: 200mb
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: true

Logs

...
12:30:52,260 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [mydrive]
12:30:52,261 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is still running

Expected behavior

Normally when it completes a crawl it will say this at the end of the log:

20:26:45,830 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
20:26:45,919 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [mydrive] stopped
20:26:45,920 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [mydrive] stopped

Versions:

Thank you

ian-cameron commented 3 years ago

For further information: After trying the same settings and versions on Windows I have not experienced the issue. I have tried Windows 10, and Server 2012 VM. I'm using OpenJDK 15.0.2. Linux is using cifs-utils to mount the directory. I think my workaround for now is using Windows to run fscrawler when crawling a windows share.

I will see about getting any more helpful information and add it as a comment.

dadoonet commented 3 years ago

I'm seeing a similar behavior from time to time. Not sure what is hanging in that case. I wonder if it's related to OCR...