Unable to extract small documents when idol-committer is installed

Norconex / committer-idol

Autonomy IDOL implementation of Norconex Committer.

https://opensource.norconex.com/committers/idol/

Apache License 2.0

4 stars 2 forks source link

Unable to extract small documents when idol-committer is installed #1

Closed jnsjak closed 5 years ago

jnsjak commented 5 years ago

When we use the idol-committer it seems unable to extract information from documents smaller than around 8kb. For small documents neither metadata or content is extracted, only the metadata representing the crawler itself is available in the -add.meta files and cntnt is empty, larger pages are ok.

This appears to happen for us on version 2.8.0 of the http-collector and above as soon as the idol-committer is installed and only then but happens regardless of whether the file- or idol committer is used as target. It extracts correctly on v. 2.7.1.

essiembre commented 5 years ago

It should work files of all sizes. Can you please share a config to reproduce the issue? Also, have you tried with 2.8.1?

jnsjak commented 5 years ago

Hi.

When I use the attached test.xml config in a clean 2.8.1 collector test.xml.txt Content is extracted as expected both for small and larger pages.

When the IDOL committer jar files are copied to the lib folder of the 2.8.1 collector and the job is rerun, no content is extracted for the smaller document in the URL list (empty cnt-file and only metadata concerning the crawler in the meta file): 1559155488271000000-add.meta.txt, but the larger is correctly extracted.

I've tried various experiments and settings but the only pattern I've been able to find to this happening, is the page size.

essiembre commented 5 years ago

To be sure: you are using the FileSystemCommitter in your config (and not the IDOL Committer), and just adding the IDOL Committer jar causes this issue? Or is it when you also change your config to use IDOLCommitter?

If the first, I would check for duplicate Jars under the list folder (different versions of the same jars).

jnsjak commented 5 years ago

The first. I've now checked all the jar files and it turned out to be duplicate versions of norconex-commons-lang (both 1.13 and 1.15 was present). I've checked and cleaned out duplicates and that fixed it. Thanks for your support!