Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Boilerpipe usage on importer #48

Closed angelo337 closed 7 years ago

angelo337 commented 7 years ago

hi there I am trying to figure it out how to use the Boilerpipe jar file, however I am not able to do it. could you please post some basic instructions or share with me an address ? thanks a lot

danizen commented 7 years ago

@angelo337, I have a bunch of code that does this, which is proprietary at this time. I can share with you basically how I did it:

  1. Download the boilerpipe wrapper for Python, https://github.com/sorpaas/python-boilerpipe/, or boilerpipe-py3 from pypi. Make sure it is working (takes some time).
  2. Look at the source code of boilerpipe/extractor/init.py
  3. Copy that logic into a custom Norconex Tagger and/or Transformer
  4. You may also want to replace boilerpipe 1.1.0 with a version of boilerpipe 1.2.x. I did this with the following changes to my pom.xml.
    <dependency>
      <groupId>com.norconex.collectors</groupId>
      <artifactId>norconex-importer</artifactId>
      <version>${norconex.importer.version}</version>
      <exclusions>
        <exclusion>
          <!-- We will get these classes somewhere else -->
          <groupId>de.l3s.boilerpipe</groupId>
          <artifactId>boilerpipe</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <!-- From here, in fact, later version -->
    <dependency>
      <groupId>com.syncthemall</groupId>
      <artifactId>boilerpipe</artifactId>
      <version>1.2.2</version>
    </dependency>

This is basically in currently proprietary work that evolved from https://github.com/danizen/trynorconex

danizen commented 7 years ago

@essiembre, my blog, http://danizen.net/ has an updated look and feel, and I do plan to blog on some of this, as a way to pay you back for all your work and also share what's worked for me.

essiembre commented 7 years ago

@danizen, looking forward to read it. :-)

angelo337 commented 7 years ago

thanks a lot for your help @danizen i am taking a look and make it work best regards