dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.35k stars 299 forks source link

skip_tika option in job configuration is ignored #979

Open aram535 opened 4 years ago

aram535 commented 4 years ago

Describe the bug

While scanning text files, app is chucking the file as an invalid XML file that is badly formatted. I figured it must be tika mis-interpreting the file, I tried to turn off tika with "skip_tika" configuration according to #846 but that did not work. (Restarted scan with --restart). The file is not fully searchable in Kibana and is not included in the result set. (Tested with a small file sample).

Job Settings

name: "idx"
fs:
  url: "/fs/archive/files/"
  update_rate: "60m"
  includes:
    - "*/*.txt"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: true
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
  skip_tika: true
elasticsearch:
  nodes:
    - url: "http://192.168.1.5:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Logs

06:05:25,072 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/fs/archive/files/13613.txt]  -> XML parse error -> The markup in the document following the root element must be well-formed.

Expected behavior

The full content of file should be indexed and searchable.

Versions:

dadoonet commented 4 years ago

Could you try to change this setting to -1 ? https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#extracted-characters

aram535 commented 4 years ago

Set it to -1: "Failed to extract [-1] characters of text for .... "

I also tried setting the value to 100%: "Failed to extract [25376] characters of text for ... "

Oddly enough it's only some files .... they're all the same permissions, same owner/group (same user as the one running the app), no ACLs. Even SELinux is disabled on this system.

EDIT: Some of the files are quiet small so it's not a size problem:

08:13:15,135 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [9023] characters of text for [/fs/archive/files/33213.txt]  -> XML parse error -> The markup in the document following the root element must be well-formed.
08:13:15,135 WARN  [f.p.e.c.f.FsParserAbstract] trying to add new file while closing crawler. Document [idx]/[f3211c7cacf16e729944445afff9242b] has been ignored
^C
$ wc -l /fs/archive/files/33213.txt
143 /fs/archive/files/33213.txt
$ wc -c /fs/archive/files/33213.txt
9023 /fs/archive/files/33213.txt

EDIT #2: There might be something weird with the "content" of the file. I think it's interpreting the email headers in the beginning of the file as "XML". I removed all the headers from the file and restarted crawler and the WARN disappeared.

dadoonet commented 4 years ago

Could you run with the --debug option?

aram535 commented 4 years ago

It is Tika that's doing it .... now why is Tika looking at a foo.txt with text inside and seeing XML I have no idea. Doesn't the skip_tika option completely skip using tika to identify a file?

09:03:34,379 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [8690] characters of text for ..... org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:81) ~[tika-parsers-1.24.1.jar:1.24.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.24.1.jar:1.24.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.24.1.jar:1.24.1] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.24.1.jar:1.24.1]

In the meantime I'll download Tika as a standalone and see what it's doing ...

aram535 commented 4 years ago

Okay, 100% it is Tika that's doing it ...

public class TikaTest {
    public static void main(String[] args) throws Exception {
        Path path = Paths.get(args[0]);
        File f = path.toFile();

        TikaConfig tika = new TikaConfig();
        Metadata metadata = new Metadata();
        metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
        System.out.println("File " + f + " is " + tika.getDetector().detect(TikaInputStream.get(f), metadata));
    }
}

... 1.txt is the plain email with header ... 2.txt is just the body of the email (removed the top 12 lines)

$ java -cp .:./TikaTest.class:./lib/tika-core-1.24.1.jar TikaTest ./1.txt File ./1.txt is application/xml $ java -cp .:./TikaTest.class:./lib/tika-core-1.24.1.jar TikaTest ./2.txt File ./2.txt is text/plain

We can turn this into a bug that "skip_tika" is being ignored in the configuration. I'll work out where the problem with Tika is with them.