Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

DOMContentFilter creates REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) #483

Closed mauromi76 closed 6 years ago

mauromi76 commented 6 years ago

Hi I am trying to filter the HTML source removing all those DIVs that i don't need (for example disclaimers, modals ecc).

I read the doc at https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/filter/impl/DOMContentFilter.html where this seems to be so easy to implement using the DOMContentFilter but in real life I'm facing the problem that when I add this filter the crawler is skipping all the pages.

as I understood this filter goes in the section.

so here's an extract of my config:

<preParseHandlers>
    <filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"  selector="div#cookie-info" onMatch="exclude" />
    <filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"  selector="div#external-dialog" onMatch="exclude" />
</preParseHandlers>

can you see anything wrong here? why this makes all pages skip with REJECTED_IMPORT (com.norconex.importer.response.ImporterResponse@3af2bcdb) ?

many thanks in advance for your support Mauro

essiembre commented 6 years ago

Filters are to exclude matching documents, not part of their content. For want you want to do, you would have better luck with transformers, such as ReplaceTransformer and StripBetweenTransformer.

There is currently no DOM-based transformers, but if you really want to deal with DOM, you can look at the DOMTagger in case it can be of any help. That class can extract specific DOM elements and store them as metadata. You could then strip the original content with a SubstringTransformer if you do not want to keep it.

FYI, there is a feature request for a DOMTansformer here: https://github.com/Norconex/importer/issues/62

mauromi76 commented 6 years ago

Hi Pascal thank you very much for the quick response. I will check the docs of ReplaceTransformer and StripBetweenTransformer to see how I can achieve to my goal.

just one question before close this topic: where can I find some info on what is the difference between <preParseHandlers> and <postParseHandlers> ?

I'd like to have a very clear view at which step of the parsing process they are actioned .

thanks Mauro

essiembre commented 6 years ago

Pre-parse handlers are invoked before original files are parsed to get the plain text out of them. You have to be careful which handler you use there since you may be trying to do text-operations on binary files (e.g. PDFs). The <restrictTo> tag available with each handler can help. For example, if you need to operate on the raw XML or HTML markup, you would go with pre-parse handlers.

The post-parse handlers are invoked after the original document was parsed and its content should be guaranteed to be plain text at that point (without any formatting). For example, if you want to reject documents that have the word "potato" in them, regardless of their content-type, you would define that under post-parse handlers.

So to over-simplify, you can think of it as:

  1. Http GET to download a raw document.
  2. Pre-parse handlers on raw document.
  3. Parse document to extract text + metadata/fields.
  4. Post-parse handlers on extracted text + metadata/fields.
  5. Commit the document (extracted text + metadata/fields), unless rejected for whatever reason.

For a more detailed view of the HTTP Collector execution flow, have a look here.

Clearer?

mauromi76 commented 6 years ago

just perfect! really thank you so much. I will mark this as closed.

Mauro