Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Is there a tagger for removing html code from a field #217

Closed bruce-genhot closed 8 years ago

bruce-genhot commented 8 years ago

I am using DomTagger to extract something, like below.

 <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
                        <dom selector="#showacticle" toField="detail" overwrite="false"></dom>
                        <restrictTo caseSensitive="false" field="document.reference"><![CDATA[http://www\.nxzfcg\.gov\.cn/morelink\.aspx\?type=12&index=2]]></restrictTo>
                    </tagger>

the result is a piece of html code, can I use a tagger to remove html code from it ?

essiembre commented 8 years ago

The DOMTagger will indeed return the HTML but the underlying API it uses allows to control whether to return the HTML or just the text. I am marking this as a feature request to add a flag to tell it what to return exactly.

In the meantime, you can use ReplaceTagger to remove HTML tags using regular expression. Like this (not tested):

  <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace fromField="detail" regex="true">
          <fromValue><![CDATA[<[^>]*>]]></fromValue>
          <toValue></toValue>
      </replace>
  </tagger>
bruce-genhot commented 8 years ago

OK, thanks.

essiembre commented 8 years ago

The latest snapshot now extracts the text by default instead of HTML. It has a new extract attribute you can add to change what it gets. Possible values are "text" (default), "html", and "outerHtml".

bruce-genhot commented 8 years ago

Yes, it works for me now, thanks.