Open Neel-Gagan opened 5 years ago
How large? What is the error message? Did you change java heap settings?
yes in did changes heap settings. (-Xms15g, -Xmx-15g) the error states that: The request entity is too large. the file has more than 100000 characters but beyond these character the rest are not crawled.
Could you share your FSCrawler settings for this job?
{
"name" : "size",
"fs" : {
"url" : "C:\Test",
"update_rate" : "1m",
"excludes" : [ "~*"],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : false,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : true,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : true,
"indexded_chars" : "100%",
"pdf_ocr" : true,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9201,
"scheme" : "HTTP"
} ],
"bulk_size" : 100,
"flush_interval" : "5s"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
Could you reduce add bulk_size: 10mb in elasticsearch settings? See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html
After doing chages in get the below error : Got a hard failure when executing the bulk request
strack trace from log files is given for reference.
An existing connection was forcibly closed by the remote host 11:38:25,659 WARN [f.p.e.c.f.FsParser] Full stacktrace java.io.IOException: An existing connection was forcibly closed by the remote host at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:728) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestClient.performRequest(RestClient.java:198) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:522) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:508) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:404) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2] at fr.pilato.elasticsearch.crawler.fs.FsParser.getFileDirectory(FsParser.java:356) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:307) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.run(FsParser.java:167) [fscrawler-core-2.5.jar:?] at java.lang.Thread.run(Unknown Source) [?:1.8.0_171] Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:1.8.0_171] at sun.nio.ch.SocketDispatcher.read(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.IOUtil.read(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.SocketChannelImpl.read(Unknown Source) ~[?:1.8.0_171] at org.apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:204) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) ~[httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) ~[httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) ~[httpcore-nio-4.4.5.jar:4.4.5] ... 1 more 11:38:25,659 INFO [f.p.e.c.f.FsParser] FS crawler is stopping after 1 run 11:38:25,675 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [d_bulk_data_search] 11:38:25,675 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped 11:38:25,675 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager 11:38:25,675 WARN [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request java.io.IOException: An existing connection was forcibly closed by the remote host at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:1.8.0_171] at sun.nio.ch.SocketDispatcher.read(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.IOUtil.read(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.SocketChannelImpl.read(Unknown Source) ~[?:1.8.0_171] at org.apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:204) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5] at java.lang.Thread.run(Unknown Source) [?:1.8.0_171] 11:38:26,706 WARN [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_171] at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) ~[?:1.8.0_171] at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:171) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:145) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192) [httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) [httpasyncclient-4.1.2.jar:4.1.2] at java.lang.Thread.run(Unknown Source) [?:1.8.0_171] 11:38:26,706 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client 11:38:26,737 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped 11:38:26,737 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [d_bulk_data_search] stopped 11:38:26,737 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [d_bulk_data_search] 11:38:26,737 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped 11:38:26,737 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager 11:38:26,737 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client 11:38:26,737 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped 11:38:26,737 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [d_bulk_data_search] stopped
I'm confused by this fscrawler-core-2.5.jar
. Because you said you are using 2.6.
Could you check that you are really using 2.6?
i did initially in FScrawler2.6 then rolled back to FScrawler2.5, thinking that a stable version will not give the error, but getting the same error in both the versions.
2.6 is stable. WDYM? 2.7-SNAPSHOT is under development. I'd use that one but sadly you can't because it's not compatible with versions before 6.7/6.8.
i am getting the same error in both the versions. is there any settings changes i need to do ? even "indexed_chars": ""-1" gives the same error : Got a hard failure when executing the bulk request
What are your fscrawler settings now?
{
"name" : "d_bulk_data_search",
"fs" : {
"url" : "C:\\Test",
"update_rate" : "1m",
"excludes" : [ "~*"],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : false,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : true,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : true,
"indexded_chars" : "100%",
"pdf_ocr" : true,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9201,
"scheme" : "HTTP"
} ],
"bulk_size" : 100,
"flush_interval" : "5s"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
Could you please format your code to make it easier to read? Use the <>
icon or markdown format. Thanks.
Could you add:
"byte_size" : "10mb",
please?
The initial answer I gave about bulk_size
was wrong as I meant byte_size
instead. See documentation here: https://fscrawler.readthedocs.io/en/fscrawler-2.6/admin/fs/elasticsearch.html#bulk-settings
including the "byte_size" : "10mb", in _settings.json giving the error ElasticsearchException [type=illegal_argument_exception,reason rejecting mapping to [d_bulk_data_search] as the final mapping would have more than 1 type.
Hmmm. That does not make sense to me. Are you using 2.6?
i am using 2.5
Please use 2.6. It might work with 2.5 though.
Here you probably have indexed data already and this is conflicting. If you can it's better to start from scratch and remove the existing index.
I Started from beginning using FSCrawler 2.6. 1) ön Including "indexed_chars" : "100%"and "byte_size" : "10mb" in _settings.json getting the error: The Request entity is too large.
2) ön Including "indexed_chars" : "-1"and "byte_size" : "10mb" in _settings.json getting the error: ElasticsearchException [type=illegal_argument_exception,reason rejecting mapping to [bulk_data_search] as the final mapping would have more than 1 type.
Ok. It doesn't make sense to me but I'll try to reproduce when I have some spare cycles.
In you last example there is a typo in the setting file:
"indexded_chars" : "100%"
Should be:
"indexed_chars" : "100%"
Could you try again with FSCrawler 2.6? Please remove the existing index before launching FSCrawler.
with corrected "indexed_chars" : "100%" in _settings and FScrawler 2.6 i am getting the issue with i increase the character limit to beyond 1 Lakh. for upto one Lakh it works fine but for one lakh +1 character. again the error : Request entity is too large.
Could you please share exactly all the steps, all the logs, all the FSCrawler settings you have?
I have hard time to follow if:
So please share as many details as you can.
Also, could you run FSCrawler with --debug
option?
I have updated to Fscrawler 2.6, removed the index and below are the settings through which i am running. i have used both the settings case 1 ) "indexded_chars" : "100%" in _settings.json but still getting the error: Request entity is too large.
case 2 ) "indexded_chars" : "-1" in _settings.json but getting the error: ElasticsearchException [type=illegal_argument_exception,reason rejecting mapping to [bulk_data] as the final mapping would have more than 1 type.
1) _settings.json
{
"name" : "bulk_data",
"fs" : {
"url" : "C:\\Test",
"update_rate" : "1m",
"excludes" : [ "~*"],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : false,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : true,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : true,
"indexded_chars" : "100%",
"pdf_ocr" : true,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"bulk_size" : 100,
"flush_interval" : "5s"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
You didn't add
"byte_size" : "10mb"
Did you remove both indices or only one? There's also a folder index.
i had removed all the indices. even with adding "byte_size" : "10mb" it gives the same error.
Could you run
GET /_cat/indices?v
Which error?
attached is output of GET /_cat/indices?v. isn't there any such limitation of one lakh characters in crawler while indexing ?
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open .monitoring-es-6-2019.06.08 FImr8Vz-S-auPbGBOxN9uw 1 0 95033 50 32.5mb 32.5mb green open .monitoring-kibana-6-2019.06.08 ywVWbouMTTyOe-e834E4NQ 1 0 8635 0 1.7mb 1.7mb green open .monitoring-es-6-2019.06.07 1mfpADfRQCS_JbKVsjX1YQ 1 0 41993 23 14.2mb 14.2mb green open .monitoring-kibana-6-2019.06.10 Va-fHNX6QduHhAJVzdm8ow 1 0 1724 0 946.4kb 946.4kb green open .monitoring-kibana-6-2019.06.09 LjlJi8cPT3mYbHgUtaZ94g 1 0 8634 0 1.7mb 1.7mb green open .monitoring-es-6-2019.06.09 yoONcm07SMO3XzDwZFxRtw 1 0 112124 54 39.2mb 39.2mb yellow open bulk_datat 1DuBbStSearGdtES2MNkw 5 1 0 0 1.2kb 1.2kb green open .monitoring-kibana-6-2019.06.07 XYo-ci3TQWiUIt13jUpopA 1 0 4687 0 1.1mb 1.1mb green open .kibana gRXkAAMFSDWYOjMgip2pEg 1 0 1 0 4kb 4kb yellow open size_folder UD9NvxtBTS63G_4ixsLZvA 5 1 0 0 1.2kb 1.2kb green open .monitoring-es-6-2019.06.10 9zmZrBE-RvuaL1ovTFnoWw 1 0 26988 232 20.2mb 20.2mb
What is the HEAP size of elasticsearch?
10 GB heap size
i have updated ES to 6.8.0 and Fscrawler to 2.6. when i run crawler on a folder separately, all contents are indexed . the settings goes as
{
"name" : "test",
"fs" : {
"url" : "D:\\ELK\\test_data",
"update_rate" : "15m",
"excludes" : [ "*/~*" ],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : false,
"indexed_chars" : "-1",
"pdf_ocr" : true,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"bulk_size" : 100,
"flush_interval" : "5s",
"byte_size" : "10mb"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
But when i take this folder inside a sub folder elastiscsearch gets stopped giving error Got a hard failure when executing the bulk request.
i tried with many combination of settings in _settings.json for
"bulk_size" : 100,
"flush_interval" : "5s",
"byte_size" : "10mb"
i even made byte_size : "1mb" and bulk_size : "50" and bulk_size : "10". can i get what exactly are the settings i can make so that i can get the contents indexed inside sub folder.
Did you solve the issue at the end? Sorry for not following up on this.
Fscrawler Version: 2.6 Elasicsearch V: 6.4.3
I am not able to crawl files which are larger in size, i did changes in "indexed.chars" in _settngs.json file. is there any other way to let the crawler crawl all the contents of file.