Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

How to get page full page content? #623

Closed Tsyklop closed 4 years ago

Tsyklop commented 5 years ago

I need to get full page content (with html tags) in my commiter. How I can do this?

For now i geting just text, without html tags and other information

Maybe exists some class which provide that I need

essiembre commented 5 years ago

Under the Importer section of your config, you can define content types you do not want to have parsed:

  <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
      <ignoredContentTypes>
         .*text/html.*
      </ignoredContentTypes>
  </documentParserFactory>