Not able to crawl large files

Neel-Gagan commented 5 years ago

Fscrawler Version: 2.6 Elasicsearch V: 6.4.3

I am not able to crawl files which are larger in size, i did changes in "indexed.chars" in _settngs.json file. is there any other way to let the crawler crawl all the contents of file.

dadoonet commented 5 years ago

How large? What is the error message? Did you change java heap settings?

Neel-Gagan commented 5 years ago

yes in did changes heap settings. (-Xms15g, -Xmx-15g) the error states that: The request entity is too large. the file has more than 100000 characters but beyond these character the rest are not crawled.

dadoonet commented 5 years ago

Could you share your FSCrawler settings for this job?

Neel-Gagan commented 5 years ago

{ "name" : "size", "fs" : { "url" : "C:\Test", "update_rate" : "1m", "excludes" : [ "~*"], "json_support" : false, "filename_as_id" : false, "add_filesize" : true, "remove_deleted" : false, "add_as_inner_object" : false, "store_source" : false, "index_content" : true, "attributes_support" : true, "raw_metadata" : true, "xml_support" : false, "index_folders" : true, "lang_detect" : false, "continue_on_error" : true, "indexded_chars" : "100%", "pdf_ocr" : true, "ocr" : { "language" : "eng"
} }, "elasticsearch" : { "nodes" : [ { "host" : "127.0.0.1", "port" : 9201, "scheme" : "HTTP" } ],
"bulk_size" : 100, "flush_interval" : "5s" }, "rest" : { "scheme" : "HTTP", "host" : "127.0.0.1", "port" : 8080, "endpoint" : "fscrawler" } }

dadoonet commented 5 years ago

Could you reduce add bulk_size: 10mb in elasticsearch settings? See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html

Neel-Gagan commented 5 years ago

After doing chages in get the below error : Got a hard failure when executing the bulk request

strack trace from log files is given for reference.

An existing connection was forcibly closed by the remote host 11:38:25,659 WARN [f.p.e.c.f.FsParser] Full stacktrace java.io.IOException: An existing connection was forcibly closed by the remote host at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:728) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestClient.performRequest(RestClient.java:198) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:522) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:508) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2] at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:404) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2] at fr.pilato.elasticsearch.crawler.fs.FsParser.getFileDirectory(FsParser.java:356) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:307) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParser.run(FsParser.java:167) [fscrawler-core-2.5.jar:?] at java.lang.Thread.run(Unknown Source) [?:1.8.0_171] Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:1.8.0_171] at sun.nio.ch.SocketDispatcher.read(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.IOUtil.read(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.SocketChannelImpl.read(Unknown Source) ~[?:1.8.0_171] at org.apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:204) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) ~[httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) ~[httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) ~[httpcore-nio-4.4.5.jar:4.4.5] ... 1 more 11:38:25,659 INFO [f.p.e.c.f.FsParser] FS crawler is stopping after 1 run 11:38:25,675 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [d_bulk_data_search] 11:38:25,675 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped 11:38:25,675 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager 11:38:25,675 WARN [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request java.io.IOException: An existing connection was forcibly closed by the remote host at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:1.8.0_171] at sun.nio.ch.SocketDispatcher.read(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.IOUtil.read(Unknown Source) ~[?:1.8.0_171] at sun.nio.ch.SocketChannelImpl.read(Unknown Source) ~[?:1.8.0_171] at org.apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:204) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136) ~[httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5] at java.lang.Thread.run(Unknown Source) [?:1.8.0_171] 11:38:26,706 WARN [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_171] at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) ~[?:1.8.0_171] at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:171) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:145) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348) [httpcore-nio-4.4.5.jar:4.4.5] at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192) [httpasyncclient-4.1.2.jar:4.1.2] at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) [httpasyncclient-4.1.2.jar:4.1.2] at java.lang.Thread.run(Unknown Source) [?:1.8.0_171] 11:38:26,706 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client 11:38:26,737 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped 11:38:26,737 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [d_bulk_data_search] stopped 11:38:26,737 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [d_bulk_data_search] 11:38:26,737 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped 11:38:26,737 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager 11:38:26,737 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client 11:38:26,737 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped 11:38:26,737 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [d_bulk_data_search] stopped

dadoonet commented 5 years ago

I'm confused by this fscrawler-core-2.5.jar. Because you said you are using 2.6. Could you check that you are really using 2.6?

Neel-Gagan commented 5 years ago

i did initially in FScrawler2.6 then rolled back to FScrawler2.5, thinking that a stable version will not give the error, but getting the same error in both the versions.

dadoonet commented 5 years ago

2.6 is stable. WDYM? 2.7-SNAPSHOT is under development. I'd use that one but sadly you can't because it's not compatible with versions before 6.7/6.8.

Neel-Gagan commented 5 years ago

i am getting the same error in both the versions. is there any settings changes i need to do ? even "indexed_chars": ""-1" gives the same error : Got a hard failure when executing the bulk request

dadoonet commented 5 years ago

What are your fscrawler settings now?

Neel-Gagan commented 5 years ago

{
  "name" : "d_bulk_data_search",
  "fs" : {
    "url" : "C:\\Test",
    "update_rate" : "1m",
    "excludes" : [ "~*"],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : false,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : true,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : true,
    "indexded_chars" : "100%",
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"     
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9201,
      "scheme" : "HTTP"
    } ],    
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

dadoonet commented 5 years ago

Could you please format your code to make it easier to read? Use the <> icon or markdown format. Thanks.

Could you add:

  "byte_size" : "10mb",

please?

dadoonet commented 5 years ago

The initial answer I gave about bulk_size was wrong as I meant byte_size instead. See documentation here: https://fscrawler.readthedocs.io/en/fscrawler-2.6/admin/fs/elasticsearch.html#bulk-settings

Neel-Gagan commented 5 years ago

including the "byte_size" : "10mb", in _settings.json giving the error ElasticsearchException [type=illegal_argument_exception,reason rejecting mapping to [d_bulk_data_search] as the final mapping would have more than 1 type.

dadoonet commented 5 years ago

Hmmm. That does not make sense to me. Are you using 2.6?

Neel-Gagan commented 5 years ago

i am using 2.5

dadoonet commented 5 years ago

Please use 2.6. It might work with 2.5 though.

Here you probably have indexed data already and this is conflicting. If you can it's better to start from scratch and remove the existing index.

Neel-Gagan commented 5 years ago

I Started from beginning using FSCrawler 2.6. 1) ön Including "indexed_chars" : "100%"and "byte_size" : "10mb" in _settings.json getting the error: The Request entity is too large.

2) ön Including "indexed_chars" : "-1"and "byte_size" : "10mb" in _settings.json getting the error: ElasticsearchException [type=illegal_argument_exception,reason rejecting mapping to [bulk_data_search] as the final mapping would have more than 1 type.

dadoonet commented 5 years ago

Ok. It doesn't make sense to me but I'll try to reproduce when I have some spare cycles.

dadoonet commented 5 years ago

In you last example there is a typo in the setting file:

"indexded_chars" : "100%"

Should be:

"indexed_chars" : "100%"

Could you try again with FSCrawler 2.6? Please remove the existing index before launching FSCrawler.

Neel-Gagan commented 5 years ago

with corrected "indexed_chars" : "100%" in _settings and FScrawler 2.6 i am getting the issue with i increase the character limit to beyond 1 Lakh. for upto one Lakh it works fine but for one lakh +1 character. again the error : Request entity is too large.

dadoonet commented 5 years ago

Could you please share exactly all the steps, all the logs, all the FSCrawler settings you have?

I have hard time to follow if:

You updated to 2.6
You removed the existing index
You changed the settings I mentioned...

So please share as many details as you can.

Also, could you run FSCrawler with --debug option?

Neel-Gagan commented 5 years ago

I have updated to Fscrawler 2.6, removed the index and below are the settings through which i am running. i have used both the settings case 1 ) "indexded_chars" : "100%" in _settings.json but still getting the error: Request entity is too large.

case 2 ) "indexded_chars" : "-1" in _settings.json but getting the error: ElasticsearchException [type=illegal_argument_exception,reason rejecting mapping to [bulk_data] as the final mapping would have more than 1 type.

1) _settings.json

{
  "name" : "bulk_data",
  "fs" : {
    "url" : "C:\\Test",
    "update_rate" : "1m",
    "excludes" : [ "~*"],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : false,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : true,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : true,
    "indexded_chars" : "100%",
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"     
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],    
    "bulk_size" : 100,
    "flush_interval" : "5s" 
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

dadoonet commented 5 years ago

You didn't add

 "byte_size" : "10mb"

dadoonet commented 5 years ago

Did you remove both indices or only one? There's also a folder index.

Neel-Gagan commented 5 years ago

i had removed all the indices. even with adding "byte_size" : "10mb" it gives the same error.

dadoonet commented 5 years ago

Could you run

GET /_cat/indices?v

Which error?

Neel-Gagan commented 5 years ago

attached is output of GET /_cat/indices?v. isn't there any such limitation of one lakh characters in crawler while indexing ?

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open .monitoring-es-6-2019.06.08 FImr8Vz-S-auPbGBOxN9uw 1 0 95033 50 32.5mb 32.5mb green open .monitoring-kibana-6-2019.06.08 ywVWbouMTTyOe-e834E4NQ 1 0 8635 0 1.7mb 1.7mb green open .monitoring-es-6-2019.06.07 1mfpADfRQCS_JbKVsjX1YQ 1 0 41993 23 14.2mb 14.2mb green open .monitoring-kibana-6-2019.06.10 Va-fHNX6QduHhAJVzdm8ow 1 0 1724 0 946.4kb 946.4kb green open .monitoring-kibana-6-2019.06.09 LjlJi8cPT3mYbHgUtaZ94g 1 0 8634 0 1.7mb 1.7mb green open .monitoring-es-6-2019.06.09 yoONcm07SMO3XzDwZFxRtw 1 0 112124 54 39.2mb 39.2mb yellow open bulk_datat 1DuBbStSearGdtES2MNkw 5 1 0 0 1.2kb 1.2kb green open .monitoring-kibana-6-2019.06.07 XYo-ci3TQWiUIt13jUpopA 1 0 4687 0 1.1mb 1.1mb green open .kibana gRXkAAMFSDWYOjMgip2pEg 1 0 1 0 4kb 4kb yellow open size_folder UD9NvxtBTS63G_4ixsLZvA 5 1 0 0 1.2kb 1.2kb green open .monitoring-es-6-2019.06.10 9zmZrBE-RvuaL1ovTFnoWw 1 0 26988 232 20.2mb 20.2mb

dadoonet commented 5 years ago

What is the HEAP size of elasticsearch?

Neel-Gagan commented 5 years ago

10 GB heap size

Neel-Gagan commented 5 years ago

i have updated ES to 6.8.0 and Fscrawler to 2.6. when i run crawler on a folder separately, all contents are indexed . the settings goes as

{
  "name" : "test",
  "fs" : {
    "url" : "D:\\ELK\\test_data",
    "update_rate" : "15m",
    "excludes" : [ "*/~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "indexed_chars" : "-1",
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "byte_size" : "10mb"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

But when i take this folder inside a sub folder elastiscsearch gets stopped giving error Got a hard failure when executing the bulk request.

i tried with many combination of settings in _settings.json for

   "bulk_size" : 100,
   "flush_interval" : "5s",
   "byte_size" : "10mb"

i even made byte_size : "1mb" and bulk_size : "50" and bulk_size : "10". can i get what exactly are the settings i can make so that i can get the contents indexed inside sub folder.

dadoonet commented 4 years ago

Did you solve the issue at the end? Sorry for not following up on this.

dadoonet / fscrawler

Not able to crawl large files #755