FSCrawler not indexing all the files

dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)

https://fscrawler.readthedocs.io/

Apache License 2.0

1.36k stars 299 forks source link

FSCrawler not indexing all the files #690

Open Ganesh-96 opened 5 years ago

Ganesh-96 commented 5 years ago

We indexed 2 million documents into elasticsearch using fscrawler. But the files count in elsaticsearch doesn't match with the files in the Share path. Is there a way to identify which files are not indexed.

tballison commented 5 years ago

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread.

Sadly, no. That won't be robust against an infinite loop. You can't kill a thread, you can only ask it to stop politely and hope for the best. The only way to "timeout" an infinite loop is to kill the process.

We have the ForkParser that spawns a separate child process and has the notion of "timeout".
tika-batch will run robustly (oom and timeout) against a directory of documents in batch mode
tika-server in --spawnChild mode spawns a child and will kill it/restart it on timeout oom, etc.

Happy to discuss if you have questions... See https://issues.apache.org/jira/browse/TIKA-456

tballison commented 5 years ago

Even we are facing similar issues, fscrawler is getting stuck while indexing some documents which are around 4gb of size.

Tika really doesn't work well with files of this size. Tika was originally designed to be streaming, but some file formats simply don't allow that. The best solution is the one you've already come to, which is to uncompress/unpack large container files: gz, zip, etc. as well as, e.g. pst/mbox...

Ganesh-96 commented 5 years ago

That'd be great to share the document that make that happen in a new issue. So I can look at it.

The main problem in sharing the document is, I can't really identify the document that is generating these issues. If I run the job in debug mode it is just creating a huge log file which is making very it very hard to find issues.

dadoonet commented 5 years ago

@Ganesh2409 If you run it in debug mode, I think that close to the WARN line:

20:17:44,580 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling \\servername\folder: integer overflow

You should also have a stack trace. That would help if you can share this one.

May be you could just debug the FsParserAbstract class. See https://fscrawler.readthedocs.io/en/latest/admin/logger.html?highlight=logger

dineshrana87 commented 3 years ago

Sir, I does not search hindi image and pdf document I have also set language like below ocr: language: "eng+hi" enabled: true path: "C:/Program Files/Tesseract-OCR" data_path: "C:/Program Files/Tesseract-OCR/tessdata" pdf_strategy: "ocr_and_text" follow_symlinks: false

Kindly tell me Thanks and Regards Dinesh Rana India

sahin52 commented 2 years ago

You may have to change the language to "eng+hin", since the code for Hindi is hin, look it up from here: https://www.loc.gov/standards/iso639-2/php/code_list.php @dineshrana87