Closed aleha84 closed 6 years ago
Even though the use case you sent me by email was perfect to reproduce, I could not reproduce your issue. I tried with the exact same config (latest snapshot) and the footer was gone. The only way I could reproduce is by parsing it with the importer directly since the "Content-Type" field and is not present and the "restrictTo" would reject it (as it is obtained from HTTP headers).
I am not sure this is related to what you are experiencing, but in either case, I recommend you use document.contentType
when referencing the content type as it is more reliable. It is should always be set and will be clean (without charset info sometimes appended).
Try with that change and let me know if that solves it.
Sorry, but i don't understand your first paragraph in part how you could reproduce it. Of course i will try to switch "restrictTo" to "document.contentType". Strange, header block removed in both cases, but footer is not.
I could only reproduce if I remove Content-Type
. Did it work for you when you changed it to document.contentType
?
changing to <restrictTo caseSensitive="false" field="document.contentType">
not helped.
But after updating importer to latest 2.8.0-SNAPSHOT footer was gone. Something was fixed in the latest version.
How safe to use SNAPSHOT versions in production? Is it not fully tested or got some unstable functionality?
Strange, it works for me with both 2.7.2 and 2.8.0-SNAPSHOT. I compared the code for the two version of StripBetweenTransformer
and could not find significant differences between the two (e.g., nothing was fixed). Maybe the problem is caused by something else? Have you tried a minimal config and just the StripBetweenTransformer
portion? You can also download the importer on its own and try to run it against your saved HTML file to see if that works.
Even if snapshots are generally not considered stable, we only release them if they pass all unit tests. Unless you are taking advantage of new features not yet polished, snapshot releases can contain fixes and enhancements to existing features.
Versions Diffs from production server logs: Footer ignored.
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex HTTP Collector 2.7.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex Collector Core 1.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex Importer 2.7.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex JEF 4.1.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex Committer Core 2.0.6-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-09 03:00:13 INFO - Version: Norconex Committer Elasticsearch 3.0.0-SNAPSHOT (Norconex Inc.)
Footer removed.
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex HTTP Collector 2.7.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex Collector Core 1.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex Importer 2.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex JEF 4.1.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex Committer Core 2.0.6-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-10-17 01:00:04 INFO - Version: Norconex Committer Elasticsearch 3.0.0-SNAPSHOT (Norconex Inc.)
Difference only in Importer. Locally i tried same collector but with 2.7.2 version. Footer was ignored. Updated locally to 2.8.0-SNAPSHOT and footer removed. So, if it helped, also updated and production version.
I prefer stability, and do not change the version of the libraries without the urgent need
Since I cannot reproduce on 2.7.2 and 2.8.0-SNAPSHOT fixes this for you, I will close.
An alternative if you want to stick to 2.7.2, you can also try with ReplaceTransformer
or ScriptTransformer
in case they work better for you.
Using stripBetween transformer to delete headers and footers from documents in preParseHandlers For most documents all is fine, but for some specified pages footer is not removed. On all pages markup is identical.
Importer version is latest stable.
Config
Markup
Sended you a mail with config.