Open aram535 opened 4 years ago
Could you try to change this setting to -1 ? https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#extracted-characters
Set it to -1: "Failed to extract [-1] characters of text for .... "
I also tried setting the value to 100%: "Failed to extract [25376] characters of text for ... "
Oddly enough it's only some files .... they're all the same permissions, same owner/group (same user as the one running the app), no ACLs. Even SELinux is disabled on this system.
EDIT: Some of the files are quiet small so it's not a size problem:
08:13:15,135 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [9023] characters of text for [/fs/archive/files/33213.txt] -> XML parse error -> The markup in the document following the root element must be well-formed.
08:13:15,135 WARN [f.p.e.c.f.FsParserAbstract] trying to add new file while closing crawler. Document [idx]/[f3211c7cacf16e729944445afff9242b] has been ignored
^C
$ wc -l /fs/archive/files/33213.txt
143 /fs/archive/files/33213.txt
$ wc -c /fs/archive/files/33213.txt
9023 /fs/archive/files/33213.txt
EDIT #2: There might be something weird with the "content" of the file. I think it's interpreting the email headers in the beginning of the file as "XML". I removed all the headers from the file and restarted crawler and the WARN disappeared.
Could you run with the --debug
option?
It is Tika that's doing it .... now why is Tika looking at a foo.txt with text inside and seeing XML I have no idea. Doesn't the skip_tika option completely skip using tika to identify a file?
09:03:34,379 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [8690] characters of text for ..... org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:81) ~[tika-parsers-1.24.1.jar:1.24.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.24.1.jar:1.24.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.24.1.jar:1.24.1] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.24.1.jar:1.24.1]
In the meantime I'll download Tika as a standalone and see what it's doing ...
Okay, 100% it is Tika that's doing it ...
public class TikaTest {
public static void main(String[] args) throws Exception {
Path path = Paths.get(args[0]);
File f = path.toFile();
TikaConfig tika = new TikaConfig();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
System.out.println("File " + f + " is " + tika.getDetector().detect(TikaInputStream.get(f), metadata));
}
}
... 1.txt is the plain email with header ... 2.txt is just the body of the email (removed the top 12 lines)
$ java -cp .:./TikaTest.class:./lib/tika-core-1.24.1.jar TikaTest ./1.txt File ./1.txt is application/xml $ java -cp .:./TikaTest.class:./lib/tika-core-1.24.1.jar TikaTest ./2.txt File ./2.txt is text/plain
We can turn this into a bug that "skip_tika" is being ignored in the configuration. I'll work out where the problem with Tika is with them.
Describe the bug
While scanning text files, app is chucking the file as an invalid XML file that is badly formatted. I figured it must be tika mis-interpreting the file, I tried to turn off tika with "skip_tika" configuration according to #846 but that did not work. (Restarted scan with --restart). The file is not fully searchable in Kibana and is not included in the result set. (Tested with a small file sample).
Job Settings
Logs
Expected behavior
The full content of file should be indexed and searchable.
Versions: