Crawled page advanced logic

AntonioAmore commented 9 years ago

I have a following task:

filter the collected page from HTML makeup, including menus, etc. Only the content; Seems default importer has such logics, but I need some advanced, including spaced/lineend collapsing, removing CSS classes by patern-set names, etc.
check the text for keywords/ key phrases, and if they're present, allow to commit the document.

For such puporses I plan to use custom Importer. Am I right when decided to select that interface? What do you recommend me to use for that?

essiembre commented 9 years ago

Creating your own implementation is the ultimate way to get precisely what you want, but I suspect existing ones can get you a long way. Hopefully those can help (they all go within your <importer> tags):

Collapsing spaces and line feeds:

<transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
      <reduce>\s</reduce>
      <reduce>\n</reduce>
      <reduce>\s\n</reduce>
</transformer>

First it will merge consecutive spaces into a single space. Then multiple line feeds into one. Finally, if you end up with consecutive "space-line feed" characters, they can be reduced too. You may have to account for the carriage return too (\r).

Removing CSS (two regex-based options):

<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer" >
      <replace>
          <fromValue>class=".*?"</fromValue>
          <toValue></toValue>
      </replace>
</transformer>

<transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true" >
      <stripBetween>
          <start>&lt;style.*?&gt;</start>
          <end>&lt;/style&gt;</end>
      </stripBetween>
</transformer>

You can use the <![CDATA[ ]]> technique if you do not want to escape XML characters in your regex like I did above.

Accept only documents having a given pattern:

<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter" onMatch="include">
      <!--the regex of what must match goes here-->
</filter>

Try these and let me know if I missed something.

AntonioAmore commented 9 years ago

I didn't get <![CDATA[ ]]> usage. May you provide me a short-short example?

And what if I want walk thru DOM structure of the document and analyze ids, classes names, etc. to delete footers, headers, menu, blocks not related to the text of the page? I know about jericho and other Java libraries. Is it possible to implement such logics from box? Or I have to write custom Importer? I'd like to have all Importer already does, and remove DOM elements which content, or id, or class match to a pattern.

essiembre commented 9 years ago

Have a quick look here for what CDATA is about. In short, it allows one to put raw text that won't be interpreted by the XML parser and thus loaded as is. CDATA must supported by all XML parsers so that's not something specific to the Collector. An example:

<start>
<![CDATA[
hey! type freely: no need to XML-encode these: "&<>"
]]>
</start>

There are currently no out-of-the-box handlers that deal with the DOM structure. Some documents can be huge and Collectors prefer to deal with streams in general (rather than load all of it in a DOM tree) to stay safe on the memory/performance side. Another reason is the very bad implementations of HTML pages on the web. HTML is very often not valid XML and can crash many XML parsers, unless there are all sorts of tweaks happening (like cleaning up the HTML code before parsing -- see below). Plus, the regex-based handlers can normally handle your HTML manipulation scenarios.

This being said, if the extra overhead is not a important factor, there are libraries out there that can help you get started. Look at TagSoup, which luckily is already shipped with the HTTP Collector (in the lib folder). It is already used by Apache Tika which is in turned used by the HTTP Collector. TagSoup will clean the HTML so you can have your valid DOM.

Offering a DOM-based handler in a future release is not a bad idea. If you have ideas what the features should be for a configurable DOM-based handler, let's make that a feature request.

AntonioAmore commented 9 years ago

Actually, I have an idea for an importer, something, like that (trying to keep your configs style):

<domRule class="com.norconex.regexp.rule" type="regexp" target="div.class div.id div.name" action="exclude">
   .*menu.*
</domRule>

It means delete from html of the page all div containers which id, or class, or name match pattern.

Or

<domRule class="com.norconex.regexp.rule" type="regexp"  action="include">
   <h1
</domRule>

which means select DOM leaf which contains "<h1" substring.

I guiess TagSoup is able to do it, just need slightly modify standard importer.

madsbrydegaard commented 9 years ago

I tried implementing the filter option:

<postParseHandlers> <filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter" onMatch="include" caseSensitive="false"> .\bkeyword\b. </filter> </postParseHandlers>

However pages without the "keyword" still gets committed. What is the correct way to use this feature?

Thnx

essiembre commented 9 years ago

Moved the last comment to a new issue: https://github.com/Norconex/collector-http/issues/108

essiembre commented 9 years ago

The latest Importer snapshot release allow for filtering or splitting XML/HTML documents using CSS/JQuery like syntax. These new Importer handlers are DOMSplitter and DOMContentFilter.

essiembre commented 9 years ago

A new DOMTagger was just added also. Grab the latest importer snapshot. It also has been included with the latest HTTP Collector snapshot.

Provide feedback if you can before I close it.

AntonioAmore commented 9 years ago

Thank you for implementation! Is it possible to have several selectors per tagger instead of several taggers at a time? It may be more productive, I think. There is a JSoup instance created, I guess, for every tagger declaration, and costly HTML document parsing is performed, so several selectors (as a union of sets, for example; may add more sets operaton) per a tagger may decrease total document processing time.

essiembre commented 9 years ago

That makes too much sense. :-) I will do so when I get a chance. Let's leave the issue open until it's done.

For the same reasons, I will try to support the same idea for DOMContentFilter, in case there could be more than one DOM condition needing to be met to include or exclude a document.

AntonioAmore commented 9 years ago

Sure, of course. I've seen in Jsoup how they used selectors to catch several elements:

Element body = doc.getElementsByTag("body").first();
body.select("script,style,meta,noscript,CDATA,link").remove();

May it work in the current implementation on the same way? Seems it is a valid selector, and the issue may be only in XML config reading (or may be not).

essiembre commented 9 years ago

Done. I dropped the idea of doing it for DOMContentFilter as well since it would not be as intuitive and less likely to occur that we need to specify multiple selector for filtering. With the latest Importer snapshot, the new way to use the DOMTagger is like this:

  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector="(selector syntax)" toField="(target field)" overwrite="[false|true]" />
      <!-- add more as needed -->
  </tagger>

I also made a new HTTP Collector snapshot release with the latest Importer in it.

Please test and confirm if that works for you.

essiembre commented 9 years ago

Did you have a chance to confirm if the new way of using DOMTagger works for you?

essiembre commented 9 years ago

Having received no feedback and since it works in my own testing, I am closing this one.

AntonioAmore commented 9 years ago

Sorry for delay with feedback.

I'm using following importer configuration (nothing else except postParserHandlers):

<postParserHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector="div.text" toField="document" overwrite="true" />
    </tagger>
</postParserHandlers>

with FileCommitter.

I believe the selector is right, because checked it on the crawled page with a software. But seems there are no changes in crawled files - they look like a source page, but not a fragment, as expected.

Using recent 2.3.0 dev snapshot.

Norconex / crawlers

Crawled page advanced logic #48