dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

Error "ArithmeticException: integer overflow" while Crawling #890

Open Neel-Gagan opened 4 years ago

Neel-Gagan commented 4 years ago

FSCRawler 2.6 Elasticsearchv 6.8 Kibana 6.8

while crawling .mdb files of around 500 MB gettting the error : Error while crawling C:\Data: integer overflow 10:25:13,992 WARN [f.p.e.c.f.FsParserAbstract] Full stacktrace java.lang.ArithmeticException: integer overflow

1) Is the integer overflow error related to data within the file or is there any way to bypass this error 2) one more issue while crawling is getting the error: Please set stored:true field on [file.filename].

what needs to be done to do away with these errors ?

dadoonet commented 4 years ago

@Neel-Gagan Could you use 2.7-SNAPSHOT? Could you run the job again only on this file (one folder with only this file) with the --debug option so there's a chance we can see a full stacktrace.

  1. I believe it's something sent by Tika. There's the continue_on_error setting which should just skip the file but may be I'm not catching the right thing. What is happening when you are getting this error? Is FCrawler stopping?
  2. The mapping is incorrect. file.filename field should be a stored field.
Neel-Gagan commented 4 years ago

1) yes the crawler is getting stopped after this error.

2) the mapping seems correct, as the error comes midway in the crawling after crawling few files.

dadoonet commented 4 years ago
  1. Are you using the continue_on_error setting?
  2. No. I don't think it is. Could you share it?
Neel-Gagan commented 4 years ago

1) continue_on_error="true" is there in fscralwer's _setting.json file.

2) i have many folder and i am applying crawling in root folder. after crawling few of the folders i get this error Please set stored:true field on [file.filename].

dadoonet commented 4 years ago
  1. I need you to use the latest SNAPSHOT, run again with --debug and share the full stacktrace.
  2. I know how FSCrawler works behind the scene and why this message appears after you have crawled already some documents. You need to have this setting in your mapping. Check it. If you have it, please share the current mapping so I can double check. If you don't, delete the index, install latest FSCrawler version, create the job file again and start FSCrawler again.
Neel-Gagan commented 4 years ago

The crawling is being performed on existing indexes moved from another system. The index is up. And in mapping also stored.filename is set. Can't figure out why this error is coming midway of crawling?

Neel-Gagan commented 4 years ago

on running a fresh installation of fscrawler 2.7 with a new job f_mi got the below mentioned error on crawling a database file of 1.8GB below is the trace file

 "file" : {
    "extension" : "accdb",
    "content_type" : "application/x-msaccess",
    "created" : "2018-07-09T09:22:54.911+0000",
    "last_modified" : "2018-08-01T08:47:11.035+0000",
    "last_accessed" : "2020-04-09T12:49:17.016+0000",
    "indexing_date" : "2020-06-09T08:02:00.993+0000",
    "filesize" : 1274957824,
    "filename" : "Database4.accdb",
    "url" : "file://F:\\Test Data\\Database4.accdb"
  },
  "path" : {
    "root" : "a9f7b81422814a76439be45c7e2281",
    "virtual" : "/Test Data/Database4.accdb",
    "real" : "F:\\Test Data\\Database4.accdb"
  }
}
13:33:07,150 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling F:\Test Data: integer overflow
13:33:07,150 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
java.lang.ArithmeticException: integer overflow
    at java.lang.Math.multiplyExact(Unknown Source) ~[?:1.8.0_171]
    at org.apache.lucene.util.UnicodeUtil.maxUTF8Length(UnicodeUtil.java:618) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:20]
    at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:20]
    at org.elasticsearch.common.bytes.BytesArray.<init>(BytesArray.java:32) ~[elasticsearch-6.6.0.jar:6.6.0]
    at org.elasticsearch.action.index.IndexRequest.source(IndexRequest.java:357) ~[elasticsearch-6.6.0.jar:6.6.0]
    at fr.pilato.elasticsearch.crawler.fs.client.v6.ElasticsearchClientV6.index(ElasticsearchClientV6.java:375) ~[fscrawler-elasticsearch-client-v6-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.esIndex(FsParserAbstract.java:577) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:479) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:267) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
    at java.lang.Thread.run(Unknown Source) [?:1.8.0_171]
13:33:07,154 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
13:33:07,201 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [f_mi]
13:33:07,201 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
13:33:07,201 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV6] Closing Elasticsearch client manager
13:33:07,201 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
13:33:07,201 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [f_mi] stopped
13:33:07,201 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [f_mi]
13:33:07,205 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
13:33:07,205 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV6] Closing Elasticsearch client manager
13:33:07,205 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
13:33:07,205 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [f_mi] stopped
Neel-Gagan commented 4 years ago

Mapping is incorrect: please set stored: true on field

Neel-Gagan commented 4 years ago

A gentle reminder regarding the raised issue.

dadoonet commented 4 years ago

Mapping is incorrect: please set stored: true on field

That's the other story we are tracking with #937. Let's not mix the problems.

This one is very interesting:

java.lang.ArithmeticException: integer overflow
    at java.lang.Math.multiplyExact(Unknown Source) ~[?:1.8.0_171]
    at org.apache.lucene.util.UnicodeUtil.maxUTF8Length(UnicodeUtil.java:618) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:20]
    at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:20]
    at org.elasticsearch.common.bytes.BytesArray.<init>(BytesArray.java:32) ~[elasticsearch-6.6.0.jar:6.6.0]
    at org.elasticsearch.action.index.IndexRequest.source(IndexRequest.java:357) ~[elasticsearch-6.6.0.jar:6.6.0]
    at fr.pilato.elasticsearch.crawler.fs.client.v6.ElasticsearchClientV6.index(ElasticsearchClientV6.java:375) ~[fscrawler-elasticsearch-client-v6-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.esIndex(FsParserAbstract.java:577) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:479) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:267) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
    at java.lang.Thread.run(Unknown Source) [?:1.8.0_171]

But it was with elasticsearch 6.6.0 client. I'd like you to upgrade Elasticsearch to the latest 6.8.10 and use the latest SNAPSHOT build for FSCrawler - es6.

Just to see if that problem has gone by upgrading ESClient and Lucene. Otherwise, that will be something I need to report to Lucene project and may be @nknize.