Open Ganesh-96 opened 5 years ago
There is no easy way for this. But I believe that some documents are not indexed because they have been rejected. So you should see that in FSCrawler logs.
I did not find any file names in the log files. So I tried with debug mode as well but the log file is around 12GB. Is there any sort of pattern for the rejected files/error messages in the debug logs?
You should see a WARN
or ERROR
message I think. It should not really part of the DEBUG
level unless you specified to ignore errors?
Yes, in the settings file I have modified the "continue_on_error" to true. I do see some error messages in the logs but couldn't find to which files these errors relate to.
11:02:18,821 WARN [o.a.p.p.f.PDTrueTypeFont] Could not read embedded TTF for font IOFIMH+Arial java.io.IOException: Error:TTF.loca unknown offset format. 11:02:18,839 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSName{TT19}] 11:02:19,273 WARN [o.a.p.p.f.PDTrueTypeFont] Could not read embedded TTF for font AIRSWE+Arial,Bold
Ha I see... Something I should may be fix if not already fixed in the latest SNAPSHOT. Can't recall from the top of my head.
Because of continue_on_error
setting, it probably does not WARN about that. I need to check the code and see if I can be smarter here.
okay, we are using 2.6 version currently. So Is there a way to identify the actual object(file/folder) name to the error messages.
I think that #675 fixes that. If you download the latest 2.7-SNAPSHOT, that should be part of it. See https://fscrawler.readthedocs.io/en/latest/installation.html
okay, I will try to see if we can upgrade to the latest version. I have one other query. As mentioned in the document I have manually downloaded the jai-imageio-core-1.3.0.jar, jai-imageio-jpeg2000-1.3.0.jar files and added to the lib directory. But I keep getting the same warning every time that > J2KImageReader not loaded.
09:20:22,316 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.
Could you share a file that produces that warning?
This messages comes up the moment any index job is started.
Ha right. I'll try it then.
okay then will wait for an update from you. One quick question, with 2.7 does it print the path of the file along with name. There are cases where multiple files have the same name but in different paths.
From what I recall it's the full path. But give it a try. I'd love to get feedback for 2.7
sure, will give it a try..
I did a test run using 2.7 snapshot version. Now I can see the error along with the file name in the log file. It shows only the file name but not the path.
Thanks. I will change that if possible. Stay tuned.
@Ganesh2409 I merged #694 which will give you the full filename. It should be available in the oss Sonatype snapshot repository in less than one hour if you'd like to try it.
All the error messages(WARN) with file names I have seen in the log files are related to parsing errors. I can see these files in the Indexes though. All these files have no content in the indexes but file properties are indexed. I see some ERROR messages in the logs but these errors doesn't mention the file names. Error Messages.txt
All these files have no content in the indexes but file properties are indexed.
That's the effect of continue_on_error
.
I see some ERROR messages in the logs but these errors doesn't mention the file names.
Hmmm... You don't have any other messages than that?
I mean between those 2 lines:
Line 111196: 23:03:40,800 ERROR [o.a.p.c.PDFStreamEngine] Operator cm has too few operands: [COSInt{0}, COSInt{0}]
Line 307748: 04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{6.48}]
All other WARN and INFO messages only.. below are couple of lines before the ERROR messages.
04:04:56,744 WARN [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'TimesNewRomanPSMT' for 'LRMSER+63shhiibsqwrsad'
04:04:56,756 WARN [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'TimesNewRomanPSMT' for 'FORKEV+63shhiibsqwrsad'
04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{6.48}]
04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{2903.48}]
Would it be possible to share a document even if you remove most of its content that reproduces it?
The problem is I am not sure which document these errors are coming. Based on the documents and the data it contains I can try and see if I can share that document or not.
Ha I see... The only way would be at this time to activate the --debug
option then and see... Not ideal for sure...
I checked with my team and we cannot share any of the files as they contain sensitive information..
Is there anyway we can ignore the warnings during the parsing and index the content?
I believe the warning means that Tika is not able to extract the content. So I'd assume that it's not possible.
oh ok, so now we came across a new scenario altogether like content not being indexed but our original issue is still the same.. as these errors we have seen are related to content parsing only but the file is present in the index. So we still have the main issue.
Is it possible to print the file names for the ERROR message also, right now I can see the filenames for WARN messages only.
Probably it could be possible but I'd need to be able to reproduce the problem if I want to fix it. Without a document which is generating this error, this is hard to guess where I should put the code.
Specifically that the error seems to be printed by Tika code and not by FSCrawler code so I'm unsure I can catch something which is not thrown.
I think I could test if content
is null
and add a warn_on_null_content
option, may be...
Unfortunately I cannot share the documents.
One more issue we are seeing is couple of jobs are getting stuck for couple of days. There is no change in the documents count in the indexes and no logs are getting printed. Probably not an issue but it will be helpful to know.
Any inputs on the above issue. Currently the job is stuck for more than a day. The file size where it got stuck is 2 gb and gz extension. This issue occurs only when "indexed_chars": "-1" is set.
Even we are facing similar issues, fscrawler is getting stuck while indexing some documents which are around 4gb of size.
Any inputs on the above issue. Currently the job is stuck for more than a day. The file size where it got stuck is 2 gb and gz extension. This issue occurs only when "indexed_chars": "-1" is set.
@Ganesh2409 How much memory did you assign to FSCrawler? I mean that it will probably require a lot of memory to unzip and parse every single content.
Ideally you should unzip the files in your directory and let FSCrawler index smaller files.
One of the feature I can may implement would be to Unzip files in a tmp dir, index that content, then remove the dir... Optional setting of course, like unzip_above: 100mb
for example.
WDYT? Would that help? It requires a bit of thinking, introducing new settings like a fscrawler_tmp dir... Probably not that quick to implement.
Another workaround would be to exclude big files with ignore-above setting.
@Bhanuji95 What kind of file is it?
The server has 32 gb memory and it is only used for FSCrawler. I haven't configured any memory specific to FSCrawler.
Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.
It is a .7z file
@Ganesh2409 Read https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html and add much more memory to FSCrawler like 16gb may be. I'll be happy to hear if this is getting better.
Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.
Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread. That's something I have in mind for the future (running in an async mode) but it's not yet there.
Would you mind opening a separate future request like "Add extraction timeout" or something like this?
@Bhanuji95 so the same answer I gave https://github.com/dadoonet/fscrawler/issues/690#issuecomment-471174852 applies.
One of the feature I can may implement would be to Unzip files in a tmp dir, index that content, then remove the dir... Optional setting of course, like
unzip_above: 100mb
for example.
Hmmm. I looked at Tika source code and it seems that Tika is actually using tmp dir to extract data.
@tballison Could you confirm that?
@Ganesh2409 Read https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html and add much more memory to FSCrawler like 16gb may be. I'll be happy to hear if this is getting better.
Sure, I will give it a try..
Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.
Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread. That's something I have in mind for the future (running in an async mode) but it's not yet there.
Would you mind opening a separate future request like "Add extraction timeout" or something like this?
Sure, I can do this but haven't done before.
I can see the file properties when there are some parsing errors. But for large files it is getting stuck. So If a file content can't be indexed, can we get the file properties indexed alone.
For Folder Indexes, we are getting only the path details in the indexes. Are there any options we have to get the Last modified date as well as we get in the Files Index.
@Ganesh2409 It does not exist. I don't think I'd like to support it as the way I'm designing the next version will remove the folder index all together.
So there won't be any information about the folders indexed in the future release or we will have those details in the files index?
Got a new error while trying to index the full content("indexed_chars": "-1") of the files.
20:17:44,580 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling \\servername\folder: integer overflow
and it got stopped even continue_on_error
is set true
.
it got stopped
You mean that FSCrawler process exited?
Yes..
That'd be great to share the document that make that happen in a new issue. So I can look at it.
Hmmm. I looked at Tika source code and it seems that Tika is actually using tmp dir to extract data.
Y, various parsers create tmp files quite often.
We indexed 2 million documents into elasticsearch using fscrawler. But the files count in elsaticsearch doesn't match with the files in the Share path. Is there a way to identify which files are not indexed.