Closed AntonioAmore closed 9 years ago
Creating your own implementation is the ultimate way to get precisely what you want, but I suspect existing ones can get you a long way. Hopefully those can help (they all go within your <importer>
tags):
Collapsing spaces and line feeds:
<transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
<reduce>\s</reduce>
<reduce>\n</reduce>
<reduce>\s\n</reduce>
</transformer>
First it will merge consecutive spaces into a single space. Then multiple line feeds into one. Finally, if you end up with consecutive "space-line feed" characters, they can be reduced too. You may have to account for the carriage return too (\r).
Removing CSS (two regex-based options):
<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer" >
<replace>
<fromValue>class=".*?"</fromValue>
<toValue></toValue>
</replace>
</transformer>
<transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true" >
<stripBetween>
<start><style.*?></start>
<end></style></end>
</stripBetween>
</transformer>
You can use the <![CDATA[ ]]>
technique if you do not want to escape XML characters in your regex like I did above.
Accept only documents having a given pattern:
<filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter" onMatch="include">
<!--the regex of what must match goes here-->
</filter>
Try these and let me know if I missed something.
I didn't get <![CDATA[ ]]>
usage. May you provide me a short-short example?
And what if I want walk thru DOM structure of the document and analyze ids, classes names, etc. to delete footers, headers, menu, blocks not related to the text of the page? I know about jericho and other Java libraries. Is it possible to implement such logics from box? Or I have to write custom Importer? I'd like to have all Importer already does, and remove DOM elements which content, or id, or class match to a pattern.
Have a quick look here for what CDATA is about. In short, it allows one to put raw text that won't be interpreted by the XML parser and thus loaded as is. CDATA must supported by all XML parsers so that's not something specific to the Collector. An example:
<start>
<![CDATA[
hey! type freely: no need to XML-encode these: "&<>"
]]>
</start>
There are currently no out-of-the-box handlers that deal with the DOM structure. Some documents can be huge and Collectors prefer to deal with streams in general (rather than load all of it in a DOM tree) to stay safe on the memory/performance side. Another reason is the very bad implementations of HTML pages on the web. HTML is very often not valid XML and can crash many XML parsers, unless there are all sorts of tweaks happening (like cleaning up the HTML code before parsing -- see below). Plus, the regex-based handlers can normally handle your HTML manipulation scenarios.
This being said, if the extra overhead is not a important factor, there are libraries out there that can help you get started. Look at TagSoup, which luckily is already shipped with the HTTP Collector (in the lib
folder). It is already used by Apache Tika which is in turned used by the HTTP Collector. TagSoup will clean the HTML so you can have your valid DOM.
Offering a DOM-based handler in a future release is not a bad idea. If you have ideas what the features should be for a configurable DOM-based handler, let's make that a feature request.
Actually, I have an idea for an importer, something, like that (trying to keep your configs style):
<domRule class="com.norconex.regexp.rule" type="regexp" target="div.class div.id div.name" action="exclude">
.*menu.*
</domRule>
It means delete from html of the page all div containers which id, or class, or name match pattern.
Or
<domRule class="com.norconex.regexp.rule" type="regexp" action="include">
<h1
</domRule>
which means select DOM leaf which contains "<h1" substring.
I guiess TagSoup is able to do it, just need slightly modify standard importer.
I tried implementing the filter option:
<postParseHandlers> <filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter" onMatch="include" caseSensitive="false"> .\bkeyword\b. </filter> </postParseHandlers>
However pages without the "keyword" still gets committed. What is the correct way to use this feature?
Thnx
Moved the last comment to a new issue: https://github.com/Norconex/collector-http/issues/108
The latest Importer snapshot release allow for filtering or splitting XML/HTML documents using CSS/JQuery like syntax. These new Importer handlers are DOMSplitter and DOMContentFilter.
A new DOMTagger was just added also. Grab the latest importer snapshot. It also has been included with the latest HTTP Collector snapshot.
Provide feedback if you can before I close it.
Thank you for implementation! Is it possible to have several selectors per tagger instead of several taggers at a time? It may be more productive, I think. There is a JSoup instance created, I guess, for every tagger declaration, and costly HTML document parsing is performed, so several selectors (as a union of sets, for example; may add more sets operaton) per a tagger may decrease total document processing time.
That makes too much sense. :-) I will do so when I get a chance. Let's leave the issue open until it's done.
For the same reasons, I will try to support the same idea for DOMContentFilter
, in case there could be more than one DOM condition needing to be met to include or exclude a document.
Sure, of course. I've seen in Jsoup how they used selectors to catch several elements:
Element body = doc.getElementsByTag("body").first();
body.select("script,style,meta,noscript,CDATA,link").remove();
May it work in the current implementation on the same way? Seems it is a valid selector, and the issue may be only in XML config reading (or may be not).
Done. I dropped the idea of doing it for DOMContentFilter
as well since it would not be as intuitive and less likely to occur that we need to specify multiple selector for filtering. With the latest Importer snapshot, the new way to use the DOMTagger is like this:
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="(selector syntax)" toField="(target field)" overwrite="[false|true]" />
<!-- add more as needed -->
</tagger>
I also made a new HTTP Collector snapshot release with the latest Importer in it.
Please test and confirm if that works for you.
Did you have a chance to confirm if the new way of using DOMTagger
works for you?
Having received no feedback and since it works in my own testing, I am closing this one.
Sorry for delay with feedback.
I'm using following importer configuration (nothing else except postParserHandlers):
<postParserHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="div.text" toField="document" overwrite="true" />
</tagger>
</postParserHandlers>
with FileCommitter.
I believe the selector is right, because checked it on the crawled page with a software. But seems there are no changes in crawled files - they look like a source page, but not a fragment, as expected.
Using recent 2.3.0 dev snapshot.
I have a following task:
For such puporses I plan to use custom Importer. Am I right when decided to select that interface? What do you recommend me to use for that?