Open Ganesh-96 opened 5 years ago
Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread.
Sadly, no. That won't be robust against an infinite loop. You can't kill a thread, you can only ask it to stop politely and hope for the best. The only way to "timeout" an infinite loop is to kill the process.
Happy to discuss if you have questions... See https://issues.apache.org/jira/browse/TIKA-456
Even we are facing similar issues, fscrawler is getting stuck while indexing some documents which are around 4gb of size.
Tika really doesn't work well with files of this size. Tika was originally designed to be streaming, but some file formats simply don't allow that. The best solution is the one you've already come to, which is to uncompress/unpack large container files: gz, zip, etc. as well as, e.g. pst/mbox...
That'd be great to share the document that make that happen in a new issue. So I can look at it.
The main problem in sharing the document is, I can't really identify the document that is generating these issues. If I run the job in debug mode it is just creating a huge log file which is making very it very hard to find issues.
@Ganesh2409 If you run it in debug mode, I think that close to the WARN line:
20:17:44,580 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling \\servername\folder: integer overflow
You should also have a stack trace. That would help if you can share this one.
May be you could just debug the FsParserAbstract
class. See https://fscrawler.readthedocs.io/en/latest/admin/logger.html?highlight=logger
Sir, I does not search hindi image and pdf document I have also set language like below ocr: language: "eng+hi" enabled: true path: "C:/Program Files/Tesseract-OCR" data_path: "C:/Program Files/Tesseract-OCR/tessdata" pdf_strategy: "ocr_and_text" follow_symlinks: false
Kindly tell me Thanks and Regards Dinesh Rana India
You may have to change the language to "eng+hin"
, since the code for Hindi is hin, look it up from here: https://www.loc.gov/standards/iso639-2/php/code_list.php @dineshrana87
We indexed 2 million documents into elasticsearch using fscrawler. But the files count in elsaticsearch doesn't match with the files in the Share path. Is there a way to identify which files are not indexed.