Closed mauromi76 closed 6 years ago
Filters are to exclude matching documents, not part of their content. For want you want to do, you would have better luck with transformers, such as ReplaceTransformer
and StripBetweenTransformer
.
There is currently no DOM-based transformers, but if you really want to deal with DOM, you can look at the DOMTagger in case it can be of any help. That class can extract specific DOM elements and store them as metadata. You could then strip the original content with a SubstringTransformer
if you do not want to keep it.
FYI, there is a feature request for a DOMTansformer here: https://github.com/Norconex/importer/issues/62
Hi Pascal thank you very much for the quick response. I will check the docs of ReplaceTransformer and StripBetweenTransformer to see how I can achieve to my goal.
just one question before close this topic:
where can I find some info on what is the difference between <preParseHandlers>
and <postParseHandlers>
?
I'd like to have a very clear view at which step of the parsing process they are actioned .
thanks Mauro
Pre-parse handlers are invoked before original files are parsed to get the plain text out of them. You have to be careful which handler you use there since you may be trying to do text-operations on binary files (e.g. PDFs). The <restrictTo>
tag available with each handler can help. For example, if you need to operate on the raw XML or HTML markup, you would go with pre-parse handlers.
The post-parse handlers are invoked after the original document was parsed and its content should be guaranteed to be plain text at that point (without any formatting). For example, if you want to reject documents that have the word "potato" in them, regardless of their content-type, you would define that under post-parse handlers.
So to over-simplify, you can think of it as:
For a more detailed view of the HTTP Collector execution flow, have a look here.
Clearer?
just perfect! really thank you so much. I will mark this as closed.
Mauro
Hi I am trying to filter the HTML source removing all those DIVs that i don't need (for example disclaimers, modals ecc).
I read the doc at https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/filter/impl/DOMContentFilter.html where this seems to be so easy to implement using the DOMContentFilter but in real life I'm facing the problem that when I add this filter the crawler is skipping all the pages.
as I understood this filter goes in thesection.
so here's an extract of my config:
can you see anything wrong here? why this makes all pages skip with REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) ?
many thanks in advance for your support Mauro