Closed bruce-genhot closed 8 years ago
The DOMTagger
will indeed return the HTML but the underlying API it uses allows to control whether to return the HTML or just the text. I am marking this as a feature request to add a flag to tell it what to return exactly.
In the meantime, you can use ReplaceTagger to remove HTML tags using regular expression. Like this (not tested):
<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
<replace fromField="detail" regex="true">
<fromValue><![CDATA[<[^>]*>]]></fromValue>
<toValue></toValue>
</replace>
</tagger>
OK, thanks.
The latest snapshot now extracts the text by default instead of HTML. It has a new extract
attribute you can add to change what it gets. Possible values are "text" (default), "html", and "outerHtml".
Yes, it works for me now, thanks.
I am using DomTagger to extract something, like below.
the result is a piece of html code, can I use a tagger to remove html code from it ?