Can FSCrawler support Text file encoding non UTF-8 (Shift-JIS)?

710255930500 commented 7 years ago

Hi. I'm running fscrawler in WindowsServer 2012 R2(Japanese version. defaut eoncoding MS932). When Text encoding non UTF-8 (Shift-JIS(MS932), and so on.) is parsed fscrawler (at apache-tika library), tika.parser selected EmptyParser and content-type selected application/octet-stream. So fscrawler cannot extract file content text.

I add JVM options "-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8" and tried it. Tika App (tika-app-1.15.jar) can detect TxtParser and selected Content-Encoding Shift-JIS.

Can FSCrawler use Text file encoding non UTF-8? Do you have any setting method? thank you.

log.txt:

14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/doc.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/folder.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/_settings.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/doc.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/folder.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/doc.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/folder.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings.json] already exists
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.FsCrawler] Cleaning existing status for job [tmp_es_2]...
14:13:48,957 [36mDEBUG[m [f.p.e.c.f.FsCrawler] Starting job [tmp_es_2]...
14:13:49,082 [30mTRACE[m [f.p.e.c.f.FsCrawler] settings used for this crawler: [{
  "name" : "tmp_es_2",
  "fs" : {
    "url" : "D:\\tmp\\ipk\\003\\001\\",
    "update_rate" : "15m",
    "includes" : [ "*.txt" ],
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : false,
    "lang_detect" : true,
    "continue_on_error" : true,
    "pdf_ocr" : false
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "10.83.159.166",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "index" : "tmp_es_2",
    "type" : "doc",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}]
14:13:49,082 [32mINFO [m [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:13:49,348 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] findVersion()
14:13:49,426 [30mTRACE[m [f.p.e.c.f.c.ElasticsearchClient] get server response: {name=Betty Brant Leeds, cluster_name=elasticsearch, cluster_uuid=mPLQXllyS5-Kf7GJ5qRR7w, version={number=2.4.1, build_hash=c67dc32e24162035d18d6fe1e952c4cbcbe79d16, build_timestamp=2016-09-27T18:57:55Z, build_snapshot=false, lucene_version=5.5.2}, tagline=You Know, for Search}
14:13:49,426 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] findVersion() -> [2.4.1]
14:13:49,426 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch < 5, so we use [fields] as fields option
14:13:49,426 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch < 5, so we can't use ingest node feature
14:13:49,426 [36mDEBUG[m [f.p.e.c.f.c.BulkProcessor] Creating a bulk processor with size [100], flush [5s], pipeline [null]
14:13:49,442 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] findVersion()
14:13:49,442 [30mTRACE[m [f.p.e.c.f.c.ElasticsearchClient] get server response: {name=Betty Brant Leeds, cluster_name=elasticsearch, cluster_uuid=mPLQXllyS5-Kf7GJ5qRR7w, version={number=2.4.1, build_hash=c67dc32e24162035d18d6fe1e952c4cbcbe79d16, build_timestamp=2016-09-27T18:57:55Z, build_snapshot=false, lucene_version=5.5.2}, tagline=You Know, for Search}
14:13:49,442 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] findVersion() -> [2.4.1]
14:13:49,442 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [2.4.1] node.
14:13:49,442 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] create index [tmp_es_2]
14:13:49,442 [30mTRACE[m [f.p.e.c.f.c.ElasticsearchClient] index settings: [{
  "settings": {
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
        }
      }
    }
  }
}
]
14:13:49,442 [30mTRACE[m [f.p.e.c.f.c.ElasticsearchClient] index already exists. Ignoring error...
14:13:49,442 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] is existing type [tmp_es_2]/[doc]
14:13:49,457 [30mTRACE[m [f.p.e.c.f.c.ElasticsearchClient] get index metadata response: {tmp_es_2={aliases={}, mappings={doc={properties={attachment={type=binary}, attributes={properties={group={type=string, index=not_analyzed}, owner={type=string, index=not_analyzed}}}, content={type=string}, file={properties={checksum={type=string, index=not_analyzed}, content_type={type=string, index=not_analyzed}, extension={type=string, index=not_analyzed}, filename={type=string, index=not_analyzed}, filesize={type=long}, indexed_chars={type=long}, indexing_date={type=date, format=dateOptionalTime}, last_modified={type=date, format=dateOptionalTime}, url={type=string, index=no}}}, meta={properties={author={type=string}, date={type=date, format=dateOptionalTime}, keywords={type=string}, language={type=string, index=not_analyzed}, raw={properties={Content-Encoding={type=string}, Content-Type={type=string}, X-Parsed-By={type=string}}}, title={type=string}}}, path={properties={encoded={type=string, index=not_analyzed}, real={type=string, index=not_analyzed, fields={tree={type=string, analyzer=fscrawler_path}}}, root={type=string, index=not_analyzed}, virtual={type=string, index=not_analyzed, fields={tree={type=string, analyzer=fscrawler_path}}}}}}}}, settings={index={creation_date=1498735137548, analysis={analyzer={fscrawler_path={tokenizer=fscrawler_path}}, tokenizer={fscrawler_path={type=path_hierarchy}}}, number_of_shards=5, number_of_replicas=1, uuid=dD5PEzXKTWS9AqQQpM6mxQ, version={created=2040199}}}, warmers={}}}
14:13:49,457 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] Mapping [tmp_es_2]/[doc] already exists.
14:13:49,457 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] creating fs crawler thread [tmp_es_2] for [D:\tmp\ipk\003\001\] every [15m]
14:13:49,457 [32mINFO [m [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [tmp_es_2] for [D:\tmp\ipk\003\001\] every [15m]
14:13:49,457 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [tmp_es_2] is now running. Run #1...
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] indexing [D:\tmp\ipk\003\001\] content
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.f.FileAbstractor] Listing local files from D:\tmp\ipk\003\001\
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] Determining 'group' is skipped for file [D:\tmp\ipk\003\001\20161110_20161017_04.doc] on [windows server 2012]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] Determining 'group' is skipped for file [D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt] on [windows server 2012]
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.f.FileAbstractor] 2 local files found
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_04.doc], includes = [[*.txt]], excludes = [[~*]]
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_04.doc], excludes = [[~*]]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] regex is [~.*?]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] does not match any exclude pattern
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_04.doc], includes = [[*.txt]]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] regex is [.*?.txt]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] does not match any include pattern
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] [20161110_20161017_04.doc] can be indexed: [false]
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl]   - ignored file/dir: 20161110_20161017_04.doc
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], includes = [[*.txt]], excludes = [[~*]]
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], excludes = [[~*]]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] regex is [~.*?]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] does not match any exclude pattern
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], includes = [[*.txt]]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] regex is [.*?.txt]
14:13:49,473 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] does match include regex
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] [20161110_20161017_shiftjis.txt] can be indexed: [true]
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl]   - file: 20161110_20161017_shiftjis.txt
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] fetching content from [D:\tmp\ipk\003\001\],[20161110_20161017_shiftjis.txt]
14:13:49,473 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] computeVirtualPathName(D:\tmp\ipk\003\001\, D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt) = 20161110_20161017_shiftjis.txt
14:13:49,473 [30mTRACE[m [f.p.e.c.f.t.TikaDocParser] Generating document [20161110_20161017_shiftjis.txt]
14:13:49,488 [30mTRACE[m [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
14:13:49,754 [33mWARN [m [o.a.t.p.i.ImageParser] JBIG2ImageReader not loaded. jbig2 files will be ignored
14:13:49,973 [30mTRACE[m [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
14:13:49,973 [30mTRACE[m [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
14:13:49,973 [30mTRACE[m [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("X-Parsed-By", "org.apache.tika.parser.EmptyParser"));
14:13:49,973 [30mTRACE[m [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Content-Type", "application/octet-stream"));
14:13:50,942 [30mTRACE[m [f.p.e.c.f.t.TikaDocParser] Main detected language: [: NONE (0.000000)]
14:13:50,942 [30mTRACE[m [f.p.e.c.f.t.TikaDocParser] End document generation
14:13:50,957 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] Indexing in ES tmp_es_2, doc, 3f75d41043ad2daf532e1ae3c9bf7cb
14:13:50,957 [30mTRACE[m [f.p.e.c.f.FsCrawlerImpl] JSon indexed : {
  "meta" : {
    "raw" : {
      "X-Parsed-By" : "org.apache.tika.parser.EmptyParser",
      "Content-Type" : "application/octet-stream"
    }
  },
  "file" : {
    "extension" : "txt",
    "content_type" : "application/octet-stream",
    "last_modified" : "2017-07-06T05:02:16.332+0000",
    "indexing_date" : "2017-07-06T05:13:49.473+0000",
    "filesize" : 378,
    "filename" : "20161110_20161017_shiftjis.txt",
    "url" : "file://D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  },
  "path" : {
    "root" : "f6ab773b58e15c04779221b211abb81",
    "virtual" : "20161110_20161017_shiftjis.txt",
    "real" : "D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  }
}
14:13:50,973 [36mDEBUG[m [f.p.e.c.f.c.BulkProcessor] {"index":{"_index":"tmp_es_2","_type":"doc","_id":"3f75d41043ad2daf532e1ae3c9bf7cb"}}
{
  "meta" : {
    "raw" : {
      "X-Parsed-By" : "org.apache.tika.parser.EmptyParser",
      "Content-Type" : "application/octet-stream"
    }
  },
  "file" : {
    "extension" : "txt",
    "content_type" : "application/octet-stream",
    "last_modified" : "2017-07-06T05:02:16.332+0000",
    "indexing_date" : "2017-07-06T05:13:49.473+0000",
    "filesize" : 378,
    "filename" : "20161110_20161017_shiftjis.txt",
    "url" : "file://D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  },
  "path" : {
    "root" : "f6ab773b58e15c04779221b211abb81",
    "virtual" : "20161110_20161017_shiftjis.txt",
    "real" : "D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  }
}
14:13:50,973 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] Looking for removed files in [D:\tmp\ipk\003\001\]...
14:13:50,973 [30mTRACE[m [f.p.e.c.f.FsCrawlerImpl] Querying elasticsearch for files in dir [path.root:f6ab773b58e15c04779221b211abb81]
14:13:50,973 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] search [tmp_es_2]/[doc], request [SearchRequest{query=path.root:f6ab773b58e15c04779221b211abb81, fields=[_source, file.filename], size=10000}]
14:13:50,988 [30mTRACE[m [f.p.e.c.f.c.ElasticsearchClient] search response: SearchResponse{hits=Hits{hits=[Hit{index=tmp_es_2, type=doc, id=87577342e971d94ae7d5ba5be38656e, version=null, source={meta={raw={X-Parsed-By=org.apache.tika.parser.EmptyParser, Content-Type=application/octet-stream}}, file={extension=txt, content_type=application/octet-stream, last_modified=2017-07-03T05:03:09.678+0000, indexing_date=2017-07-06T04:22:59.065+0000, filesize=2474, filename=20161110_20161017_euc.txt, url=file://D:\tmp\ipk\003\001\20161110_20161017_euc.txt}, path={root=f6ab773b58e15c04779221b211abb81, virtual=20161110_20161017_euc.txt, real=D:\tmp\ipk\003\001\20161110_20161017_euc.txt}}, fields={file.filename=[20161110_20161017_euc.txt]}, highlight=null}, Hit{index=tmp_es_2, type=doc, id=3f75d41043ad2daf532e1ae3c9bf7cb, version=null, source={meta={raw={X-Parsed-By=org.apache.tika.parser.EmptyParser, Content-Type=application/octet-stream}}, file={extension=txt, content_type=application/octet-stream, last_modified=2017-07-03T04:30:00.845+0000, indexing_date=2017-07-06T04:23:00.674+0000, filesize=2474, filename=20161110_20161017_shiftjis.txt, url=file://D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt}, path={root=f6ab773b58e15c04779221b211abb81, virtual=20161110_20161017_shiftjis.txt, real=D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt}}, fields={file.filename=[20161110_20161017_shiftjis.txt]}, highlight=null}], total=2}, aggregations=null}
14:13:50,988 [30mTRACE[m [f.p.e.c.f.FsCrawlerImpl] Response [SearchResponse{hits=Hits{hits=[Hit{index=tmp_es_2, type=doc, id=87577342e971d94ae7d5ba5be38656e, version=null, source={meta={raw={X-Parsed-By=org.apache.tika.parser.EmptyParser, Content-Type=application/octet-stream}}, file={extension=txt, content_type=application/octet-stream, last_modified=2017-07-03T05:03:09.678+0000, indexing_date=2017-07-06T04:22:59.065+0000, filesize=2474, filename=20161110_20161017_euc.txt, url=file://D:\tmp\ipk\003\001\20161110_20161017_euc.txt}, path={root=f6ab773b58e15c04779221b211abb81, virtual=20161110_20161017_euc.txt, real=D:\tmp\ipk\003\001\20161110_20161017_euc.txt}}, fields={file.filename=[20161110_20161017_euc.txt]}, highlight=null}, Hit{index=tmp_es_2, type=doc, id=3f75d41043ad2daf532e1ae3c9bf7cb, version=null, source={meta={raw={X-Parsed-By=org.apache.tika.parser.EmptyParser, Content-Type=application/octet-stream}}, file={extension=txt, content_type=application/octet-stream, last_modified=2017-07-03T04:30:00.845+0000, indexing_date=2017-07-06T04:23:00.674+0000, filesize=2474, filename=20161110_20161017_shiftjis.txt, url=file://D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt}, path={root=f6ab773b58e15c04779221b211abb81, virtual=20161110_20161017_shiftjis.txt, real=D:\tmp\ipk\003\001\20161110_20161017_shiftjis.txt}}, fields={file.filename=[20161110_20161017_shiftjis.txt]}, highlight=null}], total=2}, aggregations=null}]
14:13:50,988 [30mTRACE[m [f.p.e.c.f.FsCrawlerImpl] Checking file [20161110_20161017_euc.txt]
14:13:50,988 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_euc.txt], includes = [[*.txt]], excludes = [[~*]]
14:13:50,988 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_euc.txt], excludes = [[~*]]
14:13:50,988 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] regex is [~.*?]
14:13:50,988 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] does not match any exclude pattern
14:13:50,988 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_euc.txt], includes = [[*.txt]]
14:13:50,988 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] regex is [.*?.txt]
14:13:51,004 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] does match include regex
14:13:51,004 [30mTRACE[m [f.p.e.c.f.FsCrawlerImpl] Removing file [20161110_20161017_euc.txt] in elasticsearch
14:13:51,004 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] Deleting from ES tmp_es_2, doc, 87577342e971d94ae7d5ba5be38656e
14:13:51,004 [36mDEBUG[m [f.p.e.c.f.c.BulkProcessor] {"delete":{"_index":"tmp_es_2","_type":"doc","_id":"87577342e971d94ae7d5ba5be38656e"}}
14:13:51,004 [30mTRACE[m [f.p.e.c.f.FsCrawlerImpl] Checking file [20161110_20161017_shiftjis.txt]
14:13:51,004 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], includes = [[*.txt]], excludes = [[~*]]
14:13:51,004 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], excludes = [[~*]]
14:13:51,004 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] regex is [~.*?]
14:13:51,004 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] does not match any exclude pattern
14:13:51,004 [36mDEBUG[m [f.p.e.c.f.u.FsCrawlerUtil] filename = [20161110_20161017_shiftjis.txt], includes = [[*.txt]]
14:13:51,004 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] regex is [.*?.txt]
14:13:51,004 [30mTRACE[m [f.p.e.c.f.u.FsCrawlerUtil] does match include regex
14:13:51,004 [32mINFO [m [f.p.e.c.f.FsCrawlerImpl] FS crawler is stopping after 1 run
14:13:51,051 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [tmp_es_2]
14:13:51,051 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
14:13:51,051 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
14:13:51,051 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
14:13:51,051 [36mDEBUG[m [f.p.e.c.f.c.BulkProcessor] Closing BulkProcessor
14:13:51,051 [36mDEBUG[m [f.p.e.c.f.c.BulkProcessor] BulkProcessor is now closed
14:13:51,051 [36mDEBUG[m [f.p.e.c.f.c.BulkProcessor] Executing [2] remaining actions
14:13:51,051 [36mDEBUG[m [f.p.e.c.f.c.BulkProcessor] Going to execute new bulk composed of 2 actions
14:13:51,051 [30mTRACE[m [f.p.e.c.f.c.ElasticsearchClient] going to send a bulk
14:13:51,051 [30mTRACE[m [f.p.e.c.f.c.ElasticsearchClient] {"index":{"_index":"tmp_es_2","_type":"doc","_id":"3f75d41043ad2daf532e1ae3c9bf7cb"}}
{
  "meta" : {
    "raw" : {
      "X-Parsed-By" : "org.apache.tika.parser.EmptyParser",
      "Content-Type" : "application/octet-stream"
    }
  },
  "file" : {
    "extension" : "txt",
    "content_type" : "application/octet-stream",
    "last_modified" : "2017-07-06T05:02:16.332+0000",
    "indexing_date" : "2017-07-06T05:13:49.473+0000",
    "filesize" : 378,
    "filename" : "20161110_20161017_shiftjis.txt",
    "url" : "file://D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  },
  "path" : {
    "root" : "f6ab773b58e15c04779221b211abb81",
    "virtual" : "20161110_20161017_shiftjis.txt",
    "real" : "D:\\tmp\\ipk\\003\\001\\20161110_20161017_shiftjis.txt"
  }
}
{"delete":{"_index":"tmp_es_2","_type":"doc","_id":"87577342e971d94ae7d5ba5be38656e"}}

14:13:51,145 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] bulk response: BulkResponse{items=[BulkItemTopLevelResponse{index=BulkItemResponse{failed=false, index='tmp_es_2', type='doc', id='3f75d41043ad2daf532e1ae3c9bf7cb', opType=null, failureMessage='null'}, delete=null}, BulkItemTopLevelResponse{index=null, delete=BulkItemResponse{failed=false, index='tmp_es_2', type='doc', id='87577342e971d94ae7d5ba5be38656e', opType=null, failureMessage='null'}}]}
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.c.BulkProcessor] Executed bulk composed of 2 actions
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] REST client closed
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
14:13:51,145 [32mINFO [m [f.p.e.c.f.FsCrawlerImpl] FS crawler [tmp_es_2] stopped
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [tmp_es_2]
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.c.ElasticsearchClient] REST client closed
14:13:51,145 [36mDEBUG[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
14:13:51,145 [32mINFO [m [f.p.e.c.f.FsCrawlerImpl] FS crawler [tmp_es_2] stopped

20161110_20161017_shiftjis.txt

dadoonet commented 7 years ago

Tika is supposed to detect automatically the encoding using the EncodingDetector.

Not sure why it does not work in this context. May be something to report in Tika project itself?

May be @tballison has an idea?

tballison commented 7 years ago

Encoding detection is not perfect, nor is mime id; and given that this is getting passed to EmptyParser, I'm guessing that's where the failure is.

If you're able to share a file publicly, please post on our JIRA. If you can share it personally: tallison [AT] apache [DOT] org.

dadoonet commented 7 years ago

Thanks @tballison. The file is public as he uploaded it at https://github.com/dadoonet/fscrawler/files/1127034/20161110_20161017_shiftjis.txt

tballison commented 7 years ago

https://issues.apache.org/jira/browse/TIKA-2437

tballison commented 7 years ago

Does a file name exist in the application, and does the fscrawler pass in the file name to Tika:

    Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, "testTXT_shiftjis.txt");

    System.out.println(getXML("testTXT_shiftjis.txt", metadata).xml);

When you add the filename, Tika correctly parses the file.

dadoonet commented 7 years ago

Interesting. No I don't do that. Will fix then. Thanks a lot!

tballison commented 7 years ago

Great. Trying to figure out whether something is binary or text is actually kind of hard. If you can generally trust the file suffixes, and if they actually exist, then that is the best route!

dadoonet / fscrawler

Can FSCrawler support Text file encoding non UTF-8 (Shift-JIS)? #400