Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

.cntnt empty and meta tags missing when content is empty or short #600

Closed SaschaHeyer closed 5 years ago

SaschaHeyer commented 5 years ago

Hello Pascal,

when crawling pages in some occasion the .cntnt files are empty and the meta tags are not getting extracted.

To reproduce the behavior please have a look the following files

It seems that the parsing / extracting is somehow related to the length of the content.

Any suggestions / known issues?

used version: 2.8.1

Best regards Sascha

essiembre commented 5 years ago

Hello Sacha,

I just tried with the latest snapshot release and it works fine, as you can tell by this attachment: fs-committer-files.zip

So I encourage you to try with the latest snapshot, or share your config in case there is something else going on I missed.

SaschaHeyer commented 5 years ago

Hello Pascal,

thank you for your support, I can confirm the behavior is not related to Norconex itself. Rather the issue is related to a Committer Plugin which causes some dependency issue within the lib folder.

Best regards Sascha