Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Preprocess question #50

Closed angelo337 closed 7 years ago

angelo337 commented 7 years ago

Hi there i am trying to crawl a website with several file types and I have to strips before and after, and when I hit some file not application/HTML I am getting an error, is it possible to apply strips just to a single type of files? PLease, I already try to strip the crawler in HTML and other types and the other crawler just get stuck and no crawling at all Thanks a lot Angelo

essiembre commented 7 years ago

All Importer handlers support the restrictTo tag which allows you to make sure the handler is applied only on desired documents. For instance, if you want to make sure StripBeforeTransformer is only applied to text/html, you can do it like this:

  <transformer class="com.norconex.importer.handler.transformer.impl.StripBeforeTransformer">
      <restrictTo field="document.contentType">text/html</restrictTo>
      <stripBeforeRegex>.*your regex.*</stripBeforeRegex>
  </transformer>
angelo337 commented 7 years ago

thanks a lot for your answer, i am going to try it.