Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

StripBetweenTransformer different behavior #66

Closed aleha84 closed 6 years ago

aleha84 commented 6 years ago

Using stripBetween transformer to delete headers and footers from documents in preParseHandlers For most documents all is fine, but for some specified pages footer is not removed. On all pages markup is identical.

Importer version is latest stable.

Config

<stripBetween>
    <start><![CDATA[<!-- footer -->]]></start>
    <end><![CDATA[<!-- /footer -->]]></end>
</stripBetween>

Markup

<!-- footer -->
    <div class="page__footer">
            ...
    </div>
<!-- /footer -->

Sended you a mail with config.

essiembre commented 6 years ago

Even though the use case you sent me by email was perfect to reproduce, I could not reproduce your issue. I tried with the exact same config (latest snapshot) and the footer was gone. The only way I could reproduce is by parsing it with the importer directly since the "Content-Type" field and is not present and the "restrictTo" would reject it (as it is obtained from HTTP headers).

I am not sure this is related to what you are experiencing, but in either case, I recommend you use document.contentType when referencing the content type as it is more reliable. It is should always be set and will be clean (without charset info sometimes appended).

Try with that change and let me know if that solves it.

aleha84 commented 6 years ago

Sorry, but i don't understand your first paragraph in part how you could reproduce it. Of course i will try to switch "restrictTo" to "document.contentType". Strange, header block removed in both cases, but footer is not.

essiembre commented 6 years ago

I could only reproduce if I remove Content-Type. Did it work for you when you changed it to document.contentType?

aleha84 commented 6 years ago

changing to <restrictTo caseSensitive="false" field="document.contentType"> not helped. But after updating importer to latest 2.8.0-SNAPSHOT footer was gone. Something was fixed in the latest version. How safe to use SNAPSHOT versions in production? Is it not fully tested or got some unstable functionality?

essiembre commented 6 years ago

Strange, it works for me with both 2.7.2 and 2.8.0-SNAPSHOT. I compared the code for the two version of StripBetweenTransformer and could not find significant differences between the two (e.g., nothing was fixed). Maybe the problem is caused by something else? Have you tried a minimal config and just the StripBetweenTransformer portion? You can also download the importer on its own and try to run it against your saved HTML file to see if that works.

Even if snapshots are generally not considered stable, we only release them if they pass all unit tests. Unless you are taking advantage of new features not yet polished, snapshot releases can contain fixes and enhancements to existing features.

aleha84 commented 6 years ago

Versions Diffs from production server logs: Footer ignored.

[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex HTTP Collector 2.7.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex Collector Core 1.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex Importer 2.7.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex JEF 4.1.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex Committer Core 2.0.6-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex Committer Elasticsearch 3.0.0-SNAPSHOT (Norconex Inc.)

Footer removed.

[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex HTTP Collector 2.7.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex Collector Core 1.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex Importer 2.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex JEF 4.1.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex Committer Core 2.0.6-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex Committer Elasticsearch 3.0.0-SNAPSHOT (Norconex Inc.)

Difference only in Importer. Locally i tried same collector but with 2.7.2 version. Footer was ignored. Updated locally to 2.8.0-SNAPSHOT and footer removed. So, if it helped, also updated and production version.

I prefer stability, and do not change the version of the libraries without the urgent need

essiembre commented 6 years ago

Since I cannot reproduce on 2.7.2 and 2.8.0-SNAPSHOT fixes this for you, I will close.

An alternative if you want to stick to 2.7.2, you can also try with ReplaceTransformer or ScriptTransformer in case they work better for you.