dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

FSCrawler not indexing all the files #690

Open Ganesh-96 opened 5 years ago

Ganesh-96 commented 5 years ago

We indexed 2 million documents into elasticsearch using fscrawler. But the files count in elsaticsearch doesn't match with the files in the Share path. Is there a way to identify which files are not indexed.

dadoonet commented 5 years ago

There is no easy way for this. But I believe that some documents are not indexed because they have been rejected. So you should see that in FSCrawler logs.

Ganesh-96 commented 5 years ago

I did not find any file names in the log files. So I tried with debug mode as well but the log file is around 12GB. Is there any sort of pattern for the rejected files/error messages in the debug logs?

dadoonet commented 5 years ago

You should see a WARN or ERROR message I think. It should not really part of the DEBUG level unless you specified to ignore errors?

Ganesh-96 commented 5 years ago

Yes, in the settings file I have modified the "continue_on_error" to true. I do see some error messages in the logs but couldn't find to which files these errors relate to.

11:02:18,821 WARN [o.a.p.p.f.PDTrueTypeFont] Could not read embedded TTF for font IOFIMH+Arial java.io.IOException: Error:TTF.loca unknown offset format. 11:02:18,839 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSName{TT19}] 11:02:19,273 WARN [o.a.p.p.f.PDTrueTypeFont] Could not read embedded TTF for font AIRSWE+Arial,Bold

dadoonet commented 5 years ago

Ha I see... Something I should may be fix if not already fixed in the latest SNAPSHOT. Can't recall from the top of my head.

Because of continue_on_error setting, it probably does not WARN about that. I need to check the code and see if I can be smarter here.

Ganesh-96 commented 5 years ago

okay, we are using 2.6 version currently. So Is there a way to identify the actual object(file/folder) name to the error messages.

dadoonet commented 5 years ago

I think that #675 fixes that. If you download the latest 2.7-SNAPSHOT, that should be part of it. See https://fscrawler.readthedocs.io/en/latest/installation.html

Ganesh-96 commented 5 years ago

okay, I will try to see if we can upgrade to the latest version. I have one other query. As mentioned in the document I have manually downloaded the jai-imageio-core-1.3.0.jar, jai-imageio-jpeg2000-1.3.0.jar files and added to the lib directory. But I keep getting the same warning every time that > J2KImageReader not loaded.

09:20:22,316 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

dadoonet commented 5 years ago

Could you share a file that produces that warning?

Ganesh-96 commented 5 years ago

This messages comes up the moment any index job is started.

dadoonet commented 5 years ago

Ha right. I'll try it then.

Ganesh-96 commented 5 years ago

okay then will wait for an update from you. One quick question, with 2.7 does it print the path of the file along with name. There are cases where multiple files have the same name but in different paths.

dadoonet commented 5 years ago

From what I recall it's the full path. But give it a try. I'd love to get feedback for 2.7

Ganesh-96 commented 5 years ago

sure, will give it a try..

Ganesh-96 commented 5 years ago

I did a test run using 2.7 snapshot version. Now I can see the error along with the file name in the log file. It shows only the file name but not the path.

dadoonet commented 5 years ago

Thanks. I will change that if possible. Stay tuned.

dadoonet commented 5 years ago

@Ganesh2409 I merged #694 which will give you the full filename. It should be available in the oss Sonatype snapshot repository in less than one hour if you'd like to try it.

Ganesh-96 commented 5 years ago

All the error messages(WARN) with file names I have seen in the log files are related to parsing errors. I can see these files in the Indexes though. All these files have no content in the indexes but file properties are indexed. I see some ERROR messages in the logs but these errors doesn't mention the file names. Error Messages.txt

dadoonet commented 5 years ago

All these files have no content in the indexes but file properties are indexed.

That's the effect of continue_on_error.

I see some ERROR messages in the logs but these errors doesn't mention the file names.

Hmmm... You don't have any other messages than that?

I mean between those 2 lines:

Line 111196: 23:03:40,800 ERROR [o.a.p.c.PDFStreamEngine] Operator cm has too few operands: [COSInt{0}, COSInt{0}]
Line 307748: 04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{6.48}]
Ganesh-96 commented 5 years ago

All other WARN and INFO messages only.. below are couple of lines before the ERROR messages.

04:04:56,744 WARN  [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'TimesNewRomanPSMT' for 'LRMSER+63shhiibsqwrsad'
04:04:56,756 WARN  [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'TimesNewRomanPSMT' for 'FORKEV+63shhiibsqwrsad'
04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{6.48}]
04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{2903.48}]
dadoonet commented 5 years ago

Would it be possible to share a document even if you remove most of its content that reproduces it?

Ganesh-96 commented 5 years ago

The problem is I am not sure which document these errors are coming. Based on the documents and the data it contains I can try and see if I can share that document or not.

dadoonet commented 5 years ago

Ha I see... The only way would be at this time to activate the --debug option then and see... Not ideal for sure...

Ganesh-96 commented 5 years ago

I checked with my team and we cannot share any of the files as they contain sensitive information..

Ganesh-96 commented 5 years ago

Is there anyway we can ignore the warnings during the parsing and index the content?

dadoonet commented 5 years ago

I believe the warning means that Tika is not able to extract the content. So I'd assume that it's not possible.

Ganesh-96 commented 5 years ago

oh ok, so now we came across a new scenario altogether like content not being indexed but our original issue is still the same.. as these errors we have seen are related to content parsing only but the file is present in the index. So we still have the main issue.

Ganesh-96 commented 5 years ago

Is it possible to print the file names for the ERROR message also, right now I can see the filenames for WARN messages only.

dadoonet commented 5 years ago

Probably it could be possible but I'd need to be able to reproduce the problem if I want to fix it. Without a document which is generating this error, this is hard to guess where I should put the code. Specifically that the error seems to be printed by Tika code and not by FSCrawler code so I'm unsure I can catch something which is not thrown. I think I could test if content is null and add a warn_on_null_content option, may be...

Ganesh-96 commented 5 years ago

Unfortunately I cannot share the documents.

Ganesh-96 commented 5 years ago

One more issue we are seeing is couple of jobs are getting stuck for couple of days. There is no change in the documents count in the indexes and no logs are getting printed. Probably not an issue but it will be helpful to know.

Ganesh-96 commented 5 years ago

Any inputs on the above issue. Currently the job is stuck for more than a day. The file size where it got stuck is 2 gb and gz extension. This issue occurs only when "indexed_chars": "-1" is set.

Bhanuji95 commented 5 years ago

Even we are facing similar issues, fscrawler is getting stuck while indexing some documents which are around 4gb of size.

dadoonet commented 5 years ago

Any inputs on the above issue. Currently the job is stuck for more than a day. The file size where it got stuck is 2 gb and gz extension. This issue occurs only when "indexed_chars": "-1" is set.

@Ganesh2409 How much memory did you assign to FSCrawler? I mean that it will probably require a lot of memory to unzip and parse every single content. Ideally you should unzip the files in your directory and let FSCrawler index smaller files. One of the feature I can may implement would be to Unzip files in a tmp dir, index that content, then remove the dir... Optional setting of course, like unzip_above: 100mb for example. WDYT? Would that help? It requires a bit of thinking, introducing new settings like a fscrawler_tmp dir... Probably not that quick to implement. Another workaround would be to exclude big files with ignore-above setting.

@Bhanuji95 What kind of file is it?

Ganesh-96 commented 5 years ago

The server has 32 gb memory and it is only used for FSCrawler. I haven't configured any memory specific to FSCrawler.

Ganesh-96 commented 5 years ago

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

Bhanuji95 commented 5 years ago

It is a .7z file

dadoonet commented 5 years ago

@Ganesh2409 Read https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html and add much more memory to FSCrawler like 16gb may be. I'll be happy to hear if this is getting better.

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread. That's something I have in mind for the future (running in an async mode) but it's not yet there.

Would you mind opening a separate future request like "Add extraction timeout" or something like this?

dadoonet commented 5 years ago

@Bhanuji95 so the same answer I gave https://github.com/dadoonet/fscrawler/issues/690#issuecomment-471174852 applies.

dadoonet commented 5 years ago

One of the feature I can may implement would be to Unzip files in a tmp dir, index that content, then remove the dir... Optional setting of course, like unzip_above: 100mb for example.

Hmmm. I looked at Tika source code and it seems that Tika is actually using tmp dir to extract data.

See https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/io/TemporaryResources.java

@tballison Could you confirm that?

Ganesh-96 commented 5 years ago

@Ganesh2409 Read https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html and add much more memory to FSCrawler like 16gb may be. I'll be happy to hear if this is getting better.

Sure, I will give it a try..

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread. That's something I have in mind for the future (running in an async mode) but it's not yet there.

Would you mind opening a separate future request like "Add extraction timeout" or something like this?

Sure, I can do this but haven't done before.

Ganesh-96 commented 5 years ago

I can see the file properties when there are some parsing errors. But for large files it is getting stuck. So If a file content can't be indexed, can we get the file properties indexed alone.

Ganesh-96 commented 5 years ago

For Folder Indexes, we are getting only the path details in the indexes. Are there any options we have to get the Last modified date as well as we get in the Files Index.

dadoonet commented 5 years ago

@Ganesh2409 It does not exist. I don't think I'd like to support it as the way I'm designing the next version will remove the folder index all together.

Ganesh-96 commented 5 years ago

So there won't be any information about the folders indexed in the future release or we will have those details in the files index?

Ganesh-96 commented 5 years ago

Got a new error while trying to index the full content("indexed_chars": "-1") of the files.

20:17:44,580 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling \\servername\folder: integer overflow

and it got stopped even continue_on_error is set true.

dadoonet commented 5 years ago

it got stopped

You mean that FSCrawler process exited?

Ganesh-96 commented 5 years ago

Yes..

dadoonet commented 5 years ago

That'd be great to share the document that make that happen in a new issue. So I can look at it.

tballison commented 5 years ago

Hmmm. I looked at Tika source code and it seems that Tika is actually using tmp dir to extract data.

Y, various parsers create tmp files quite often.