Closed 710255930500 closed 7 years ago
Tika is supposed to detect automatically the encoding using the EncodingDetector.
Not sure why it does not work in this context. May be something to report in Tika project itself?
May be @tballison has an idea?
Encoding detection is not perfect, nor is mime id; and given that this is getting passed to EmptyParser, I'm guessing that's where the failure is.
If you're able to share a file publicly, please post on our JIRA. If you can share it personally: tallison [AT] apache [DOT] org.
Thanks @tballison. The file is public as he uploaded it at https://github.com/dadoonet/fscrawler/files/1127034/20161110_20161017_shiftjis.txt
Does a file name exist in the application, and does the fscrawler pass in the file name to Tika:
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, "testTXT_shiftjis.txt");
System.out.println(getXML("testTXT_shiftjis.txt", metadata).xml);
When you add the filename, Tika correctly parses the file.
Interesting. No I don't do that. Will fix then. Thanks a lot!
Great. Trying to figure out whether something is binary or text is actually kind of hard. If you can generally trust the file suffixes, and if they actually exist, then that is the best route!
Hi. I'm running fscrawler in WindowsServer 2012 R2(Japanese version. defaut eoncoding MS932). When Text encoding non UTF-8 (Shift-JIS(MS932), and so on.) is parsed fscrawler (at apache-tika library), tika.parser selected EmptyParser and content-type selected application/octet-stream. So fscrawler cannot extract file content text.
I add JVM options "-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8" and tried it. Tika App (tika-app-1.15.jar) can detect TxtParser and selected Content-Encoding Shift-JIS.
Can FSCrawler use Text file encoding non UTF-8? Do you have any setting method? thank you.
log.txt
:20161110_20161017_shiftjis.txt