dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

Can FSCrawler support Text file encoding non UTF-8 (Shift-JIS)? #400

Closed 710255930500 closed 7 years ago

710255930500 commented 7 years ago

Hi. I'm running fscrawler in WindowsServer 2012 R2(Japanese version. defaut eoncoding MS932). When Text encoding non UTF-8 (Shift-JIS(MS932), and so on.) is parsed fscrawler (at apache-tika library), tika.parser selected EmptyParser and content-type selected application/octet-stream. So fscrawler cannot extract file content text.

I add JVM options "-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8" and tried it. Tika App (tika-app-1.15.jar) can detect TxtParser and selected Content-Encoding Shift-JIS.

Can FSCrawler use Text file encoding non UTF-8? Do you have any setting method? thank you.

14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/doc.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/folder.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/_settings.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/doc.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/folder.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/doc.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/folder.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings.json] already exists
14:13:48,957 DEBUG [f.p.e.c.f.FsCrawler] Cleaning existing status for job [tmp_es_2]...
14:13:48,957 DEBUG [f.p.e.c.f.FsCrawler] Starting job [tmp_es_2]...
14:13:49,082 TRACE [f.p.e.c.f.FsCrawler] settings used for this crawler: [{
  "name" : "tmp_es_2",
  "fs" : {
    "url" : "D:\\tmp\\ipk\\003\\001\\",
    "update_rate" : "15m",
    "includes" : [ "*.txt" ],
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : false,
    "lang_detect" : true,
    "continue_on_error" : true,
    "pdf_ocr" : false
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "10.83.159.166",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "index" : "tmp_es_2",
    "type" : "doc",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}]
14:13:49,082 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:13:49,348 DEBUG [f.p.e.c.f.c.ElasticsearchClient] findVersion()
14:13:49,426 TRACE [f.p.e.c.f.c.ElasticsearchClient] get server response: {name=Betty Brant Leeds, cluster_name=elasticsearch, cluster_uuid=mPLQXllyS5-Kf7GJ5qRR7w, version={number=2.4.1, build_hash=c67dc32e24162035d18d6fe1e952c4cbcbe79d16, build_timestamp=2016-09-27T18:57:55Z, build_snapshot=false, lucene_version=5.5.2}, tagline=You Know, for Search}
14:13:49,426 DEBUG [f.p.e.c.f.c.ElasticsearchClient] findVersion() -> [2.4.1]
14:13:49,426 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch < 5, so we use [fields] as fields option
14:13:49,426 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch < 5, so we can't use ingest node feature
14:13:49,426 DEBUG [f.p.e.c.f.c.BulkProcessor] Creating a bulk processor with size [100], flush [5s], pipeline [null]
14:13:49,442 DEBUG [f.p.e.c.f.c.ElasticsearchClient] findVersion()
14:13:49,442 TRACE [f.p.e.c.f.c.ElasticsearchClient] get server response: {name=Betty Brant Leeds, cluster_name=elasticsearch, cluster_uuid=mPLQXllyS5-Kf7GJ5qRR7w, version={number=2.4.1, build_hash=c67dc32e24162035d18d6fe1e952c4cbcbe79d16, build_timestamp=2016-09-27T18:57:55Z, build_snapshot=false, lucene_version=5.5.2}, tagline=You Know, for Search}
14:13:49,442 DEBUG [f.p.e.c.f.c.ElasticsearchClient] findVersion() -> [2.4.1]
14:13:49,442 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [2.4.1] node.
14:13:49,442 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [tmp_es_2]
14:13:49,442 TRACE [f.p.e.c.f.c.ElasticsearchClient] index settings: [{
  "settings": {
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
        }
      }
    }
  }
}
]
14:13:49,442 TRACE [f.p.e.c.f.c.ElasticsearchClient] index already exists. Ignoring error...
14:13:49,442 DEBUG [f.p.e.c.f.c.ElasticsearchClient] is existing type [tmp_es_2]/[doc]
14:13:49,457 TRACE [f.p.e.c.f.c.ElasticsearchClient] get index metadata response: {tmp_es_2={aliases={}, mappings={doc={properties={attachment={type=binary}, attributes={properties={group={type=string, index=not_analyzed}, owner={type=string, index=not_analyzed}}}, content={type=string}, file={properties={checksum={type=string, index=not_analyzed}, content_type={type=string, index=not_analyzed}, extension={type=string, index=not_analyzed}, filename={type=string, index=not_analyzed}, filesize={type=long}, indexed_chars={type=long}, indexing_date={type=date, format=dateOptionalTime}, last_modified={type=date, format=dateOptionalTime}, url={type=string, index=no}}}, meta={properties={author={type=string}, date={type=date, format=dateOptionalTime}, keywords={type=string}, language={type=string, index=not_analyzed}, raw={properties={Content-Encoding={type=string}, Content-Type={type=string}, X-Parsed-By={type=string}}}, title={type=string}}}, path={properties={encoded={type=string, index=not_analyzed}, real={type=string, index=not_analyzed, fields={tree={type=string, analyzer=fscrawler_path}}}, root={type=string, index=not_analyzed}, virtual={type=string, index=not_analyzed, fields={tree={type=string, analyzer=fscrawler_path}}}}}}}}, settings={index={creation_date=1498735137548, analysis={analyzer={fscrawler_path={tokenizer=fscrawler_path}}, tokenizer={fscrawler_path={type=path_hierarchy}}}, number_of_shards=5, number_of_replicas=1, uuid=dD5PEzXKTWS9AqQQpM6mxQ, version={created=2040199}}}, warmers={}}}
14:13:49,457 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Mapping [tmp_es_2]/[doc] already exists.
14:13:49,457 DEBUG [f.p.e.c.f.FsCrawlerImpl] creating fs crawler thread [tmp_es_2] for [D:\tmp\ipk\003\001\] every [15m]
14:13:49,457 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [tmp_es_2] for [D:\tmp\ipk\003\001\] every [15m]
14:13:49,457 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [tmp_es_2] is now running. Run #1...
14:13:49,473 DEBUG [f.p.e.c.f.FsCrawlerImpl] indexing [D:\tmp\ipk\003\001\] content
14:13:49,473 DEBUG [f.p.e.c.f.f.FileAbstractor] Listing local files from D:\tmp\ipk\003\001\
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] Determining 'group' is skipped for file [D:\tmp\ipk\003\001\20161110_20161017_04.doc] on [windows server 2012]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] Determining 'group' is skipped for file [D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt] on [windows server 2012]
14:13:49,473 DEBUG [f.p.e.c.f.f.FileAbstractor] 2 local files found
14:13:49,473 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_04.doc], includes = [[*.txt]], excludes = [[~*]]
14:13:49,473 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_04.doc], excludes = [[~*]]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] regex is [~.*?]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] does not match any exclude pattern
14:13:49,473 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_04.doc], includes = [[*.txt]]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] regex is [.*?.txt]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] does not match any include pattern
14:13:49,473 DEBUG [f.p.e.c.f.FsCrawlerImpl] [20161110_20161017_04.doc] can be indexed: [false]
14:13:49,473 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - ignored file/dir: 20161110_20161017_04.doc
14:13:49,473 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], includes = [[*.txt]], excludes = [[~*]]
14:13:49,473 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], excludes = [[~*]]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] regex is [~.*?]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] does not match any exclude pattern
14:13:49,473 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], includes = [[*.txt]]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] regex is [.*?.txt]
14:13:49,473 TRACE [f.p.e.c.f.u.FsCrawlerUtil] does match include regex
14:13:49,473 DEBUG [f.p.e.c.f.FsCrawlerImpl] [20161110_20161017_shiftjis.txt] can be indexed: [true]
14:13:49,473 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: 20161110_20161017_shiftjis.txt
14:13:49,473 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [D:\tmp\ipk\003\001\],[20161110_20161017_shiftjis.txt]
14:13:49,473 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] computeVirtualPathName(D:\tmp\ipk\003\001\, D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt) = 20161110_20161017_shiftjis.txt
14:13:49,473 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [20161110_20161017_shiftjis.txt]
14:13:49,488 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
14:13:49,754 WARN  [o.a.t.p.i.ImageParser] JBIG2ImageReader not loaded. jbig2 files will be ignored
14:13:49,973 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
14:13:49,973 TRACE [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
14:13:49,973 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("X-Parsed-By", "org.apache.tika.parser.EmptyParser"));
14:13:49,973 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Content-Type", "application/octet-stream"));
14:13:50,942 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [: NONE (0.000000)]
14:13:50,942 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
14:13:50,957 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing in ES tmp_es_2, doc, 3f75d41043ad2daf532e1ae3c9bf7cb
14:13:50,957 TRACE [f.p.e.c.f.FsCrawlerImpl] JSon indexed : {
  "meta" : {
    "raw" : {
      "X-Parsed-By" : "org.apache.tika.parser.EmptyParser",
      "Content-Type" : "application/octet-stream"
    }
  },
  "file" : {
    "extension" : "txt",
    "content_type" : "application/octet-stream",
    "last_modified" : "2017-07-06T05:02:16.332+0000",
    "indexing_date" : "2017-07-06T05:13:49.473+0000",
    "filesize" : 378,
    "filename" : "20161110_20161017_shiftjis.txt",
    "url" : "file://D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  },
  "path" : {
    "root" : "f6ab773b58e15c04779221b211abb81",
    "virtual" : "20161110_20161017_shiftjis.txt",
    "real" : "D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  }
}
14:13:50,973 DEBUG [f.p.e.c.f.c.BulkProcessor] {"index":{"_index":"tmp_es_2","_type":"doc","_id":"3f75d41043ad2daf532e1ae3c9bf7cb"}}
{
  "meta" : {
    "raw" : {
      "X-Parsed-By" : "org.apache.tika.parser.EmptyParser",
      "Content-Type" : "application/octet-stream"
    }
  },
  "file" : {
    "extension" : "txt",
    "content_type" : "application/octet-stream",
    "last_modified" : "2017-07-06T05:02:16.332+0000",
    "indexing_date" : "2017-07-06T05:13:49.473+0000",
    "filesize" : 378,
    "filename" : "20161110_20161017_shiftjis.txt",
    "url" : "file://D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  },
  "path" : {
    "root" : "f6ab773b58e15c04779221b211abb81",
    "virtual" : "20161110_20161017_shiftjis.txt",
    "real" : "D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  }
}
14:13:50,973 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed files in [D:\tmp\ipk\003\001\]...
14:13:50,973 TRACE [f.p.e.c.f.FsCrawlerImpl] Querying elasticsearch for files in dir [path.root:f6ab773b58e15c04779221b211abb81]
14:13:50,973 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [tmp_es_2]/[doc], request [SearchRequest{query=path.root:f6ab773b58e15c04779221b211abb81, fields=[_source, file.filename], size=10000}]
14:13:50,988 TRACE [f.p.e.c.f.c.ElasticsearchClient] search response: SearchResponse{hits=Hits{hits=[Hit{index=tmp_es_2, type=doc, id=87577342e971d94ae7d5ba5be38656e, version=null, source={meta={raw={X-Parsed-By=org.apache.tika.parser.EmptyParser, Content-Type=application/octet-stream}}, file={extension=txt, content_type=application/octet-stream, last_modified=2017-07-03T05:03:09.678+0000, indexing_date=2017-07-06T04:22:59.065+0000, filesize=2474, filename=20161110_20161017_euc.txt, url=file://D:\tmp\ipk\003\001\20161110_20161017_euc.txt}, path={root=f6ab773b58e15c04779221b211abb81, virtual=20161110_20161017_euc.txt, real=D:\tmp\ipk\003\001\20161110_20161017_euc.txt}}, fields={file.filename=[20161110_20161017_euc.txt]}, highlight=null}, Hit{index=tmp_es_2, type=doc, id=3f75d41043ad2daf532e1ae3c9bf7cb, version=null, source={meta={raw={X-Parsed-By=org.apache.tika.parser.EmptyParser, Content-Type=application/octet-stream}}, file={extension=txt, content_type=application/octet-stream, last_modified=2017-07-03T04:30:00.845+0000, indexing_date=2017-07-06T04:23:00.674+0000, filesize=2474, filename=20161110_20161017_shiftjis.txt, url=file://D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt}, path={root=f6ab773b58e15c04779221b211abb81, virtual=20161110_20161017_shiftjis.txt, real=D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt}}, fields={file.filename=[20161110_20161017_shiftjis.txt]}, highlight=null}], total=2}, aggregations=null}
14:13:50,988 TRACE [f.p.e.c.f.FsCrawlerImpl] Response [SearchResponse{hits=Hits{hits=[Hit{index=tmp_es_2, type=doc, id=87577342e971d94ae7d5ba5be38656e, version=null, source={meta={raw={X-Parsed-By=org.apache.tika.parser.EmptyParser, Content-Type=application/octet-stream}}, file={extension=txt, content_type=application/octet-stream, last_modified=2017-07-03T05:03:09.678+0000, indexing_date=2017-07-06T04:22:59.065+0000, filesize=2474, filename=20161110_20161017_euc.txt, url=file://D:\tmp\ipk\003\001\20161110_20161017_euc.txt}, path={root=f6ab773b58e15c04779221b211abb81, virtual=20161110_20161017_euc.txt, real=D:\tmp\ipk\003\001\20161110_20161017_euc.txt}}, fields={file.filename=[20161110_20161017_euc.txt]}, highlight=null}, Hit{index=tmp_es_2, type=doc, id=3f75d41043ad2daf532e1ae3c9bf7cb, version=null, source={meta={raw={X-Parsed-By=org.apache.tika.parser.EmptyParser, Content-Type=application/octet-stream}}, file={extension=txt, content_type=application/octet-stream, last_modified=2017-07-03T04:30:00.845+0000, indexing_date=2017-07-06T04:23:00.674+0000, filesize=2474, filename=20161110_20161017_shiftjis.txt, url=file://D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt}, path={root=f6ab773b58e15c04779221b211abb81, virtual=20161110_20161017_shiftjis.txt, real=D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt}}, fields={file.filename=[20161110_20161017_shiftjis.txt]}, highlight=null}], total=2}, aggregations=null}]
14:13:50,988 TRACE [f.p.e.c.f.FsCrawlerImpl] Checking file [20161110_20161017_euc.txt]
14:13:50,988 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_euc.txt], includes = [[*.txt]], excludes = [[~*]]
14:13:50,988 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_euc.txt], excludes = [[~*]]
14:13:50,988 TRACE [f.p.e.c.f.u.FsCrawlerUtil] regex is [~.*?]
14:13:50,988 TRACE [f.p.e.c.f.u.FsCrawlerUtil] does not match any exclude pattern
14:13:50,988 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_euc.txt], includes = [[*.txt]]
14:13:50,988 TRACE [f.p.e.c.f.u.FsCrawlerUtil] regex is [.*?.txt]
14:13:51,004 TRACE [f.p.e.c.f.u.FsCrawlerUtil] does match include regex
14:13:51,004 TRACE [f.p.e.c.f.FsCrawlerImpl] Removing file [20161110_20161017_euc.txt] in elasticsearch
14:13:51,004 DEBUG [f.p.e.c.f.FsCrawlerImpl] Deleting from ES tmp_es_2, doc, 87577342e971d94ae7d5ba5be38656e
14:13:51,004 DEBUG [f.p.e.c.f.c.BulkProcessor] {"delete":{"_index":"tmp_es_2","_type":"doc","_id":"87577342e971d94ae7d5ba5be38656e"}}
14:13:51,004 TRACE [f.p.e.c.f.FsCrawlerImpl] Checking file [20161110_20161017_shiftjis.txt]
14:13:51,004 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], includes = [[*.txt]], excludes = [[~*]]
14:13:51,004 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], excludes = [[~*]]
14:13:51,004 TRACE [f.p.e.c.f.u.FsCrawlerUtil] regex is [~.*?]
14:13:51,004 TRACE [f.p.e.c.f.u.FsCrawlerUtil] does not match any exclude pattern
14:13:51,004 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], includes = [[*.txt]]
14:13:51,004 TRACE [f.p.e.c.f.u.FsCrawlerUtil] regex is [.*?.txt]
14:13:51,004 TRACE [f.p.e.c.f.u.FsCrawlerUtil] does match include regex
14:13:51,004 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler is stopping after 1 run
14:13:51,051 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [tmp_es_2]
14:13:51,051 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
14:13:51,051 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
14:13:51,051 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
14:13:51,051 DEBUG [f.p.e.c.f.c.BulkProcessor] Closing BulkProcessor
14:13:51,051 DEBUG [f.p.e.c.f.c.BulkProcessor] BulkProcessor is now closed
14:13:51,051 DEBUG [f.p.e.c.f.c.BulkProcessor] Executing [2] remaining actions
14:13:51,051 DEBUG [f.p.e.c.f.c.BulkProcessor] Going to execute new bulk composed of 2 actions
14:13:51,051 TRACE [f.p.e.c.f.c.ElasticsearchClient] going to send a bulk
14:13:51,051 TRACE [f.p.e.c.f.c.ElasticsearchClient] {"index":{"_index":"tmp_es_2","_type":"doc","_id":"3f75d41043ad2daf532e1ae3c9bf7cb"}}
{
  "meta" : {
    "raw" : {
      "X-Parsed-By" : "org.apache.tika.parser.EmptyParser",
      "Content-Type" : "application/octet-stream"
    }
  },
  "file" : {
    "extension" : "txt",
    "content_type" : "application/octet-stream",
    "last_modified" : "2017-07-06T05:02:16.332+0000",
    "indexing_date" : "2017-07-06T05:13:49.473+0000",
    "filesize" : 378,
    "filename" : "20161110_20161017_shiftjis.txt",
    "url" : "file://D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  },
  "path" : {
    "root" : "f6ab773b58e15c04779221b211abb81",
    "virtual" : "20161110_20161017_shiftjis.txt",
    "real" : "D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  }
}
{"delete":{"_index":"tmp_es_2","_type":"doc","_id":"87577342e971d94ae7d5ba5be38656e"}}

14:13:51,145 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk response: BulkResponse{items=[BulkItemTopLevelResponse{index=BulkItemResponse{failed=false, index='tmp_es_2', type='doc', id='3f75d41043ad2daf532e1ae3c9bf7cb', opType=null, failureMessage='null'}, delete=null}, BulkItemTopLevelResponse{index=null, delete=BulkItemResponse{failed=false, index='tmp_es_2', type='doc', id='87577342e971d94ae7d5ba5be38656e', opType=null, failureMessage='null'}}]}
14:13:51,145 DEBUG [f.p.e.c.f.c.BulkProcessor] Executed bulk composed of 2 actions
14:13:51,145 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
14:13:51,145 DEBUG [f.p.e.c.f.c.ElasticsearchClient] REST client closed
14:13:51,145 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
14:13:51,145 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [tmp_es_2] stopped
14:13:51,145 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [tmp_es_2]
14:13:51,145 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
14:13:51,145 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
14:13:51,145 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
14:13:51,145 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
14:13:51,145 DEBUG [f.p.e.c.f.c.ElasticsearchClient] REST client closed
14:13:51,145 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
14:13:51,145 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [tmp_es_2] stopped

20161110_20161017_shiftjis.txt

dadoonet commented 7 years ago

Tika is supposed to detect automatically the encoding using the EncodingDetector.

Not sure why it does not work in this context. May be something to report in Tika project itself?

May be @tballison has an idea?

tballison commented 7 years ago

Encoding detection is not perfect, nor is mime id; and given that this is getting passed to EmptyParser, I'm guessing that's where the failure is.

If you're able to share a file publicly, please post on our JIRA. If you can share it personally: tallison [AT] apache [DOT] org.

dadoonet commented 7 years ago

Thanks @tballison. The file is public as he uploaded it at https://github.com/dadoonet/fscrawler/files/1127034/20161110_20161017_shiftjis.txt

tballison commented 7 years ago

https://issues.apache.org/jira/browse/TIKA-2437

tballison commented 7 years ago

Does a file name exist in the application, and does the fscrawler pass in the file name to Tika:

    Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, "testTXT_shiftjis.txt");

    System.out.println(getXML("testTXT_shiftjis.txt", metadata).xml);

When you add the filename, Tika correctly parses the file.

dadoonet commented 7 years ago

Interesting. No I don't do that. Will fix then. Thanks a lot!

tballison commented 7 years ago

Great. Trying to figure out whether something is binary or text is actually kind of hard. If you can generally trust the file suffixes, and if they actually exist, then that is the best route!