Closed jnsjak closed 5 years ago
It should work files of all sizes. Can you please share a config to reproduce the issue? Also, have you tried with 2.8.1?
Hi.
When I use the attached test.xml config in a clean 2.8.1 collector test.xml.txt Content is extracted as expected both for small and larger pages.
When the IDOL committer jar files are copied to the lib folder of the 2.8.1 collector and the job is rerun, no content is extracted for the smaller document in the URL list (empty cnt-file and only metadata concerning the crawler in the meta file): 1559155488271000000-add.meta.txt, but the larger is correctly extracted.
I've tried various experiments and settings but the only pattern I've been able to find to this happening, is the page size.
To be sure: you are using the FileSystemCommitter in your config (and not the IDOL Committer), and just adding the IDOL Committer jar causes this issue? Or is it when you also change your config to use IDOLCommitter?
If the first, I would check for duplicate Jars under the list folder (different versions of the same jars).
The first. I've now checked all the jar files and it turned out to be duplicate versions of norconex-commons-lang (both 1.13 and 1.15 was present). I've checked and cleaned out duplicates and that fixed it. Thanks for your support!
When we use the idol-committer it seems unable to extract information from documents smaller than around 8kb. For small documents neither metadata or content is extracted, only the metadata representing the crawler itself is available in the -add.meta files and cntnt is empty, larger pages are ok.
This appears to happen for us on version 2.8.0 of the http-collector and above as soon as the idol-committer is installed and only then but happens regardless of whether the file- or idol committer is used as target. It extracts correctly on v. 2.7.1.