Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
21 stars 13 forks source link

TextPatternTagger only on content field? #14

Closed jmrichardson closed 7 years ago

jmrichardson commented 7 years ago

Hi,

I am trying to get the textpatterngagger (suggest something different if this is not a good idea) to get just the filename from the document path (document.reference). Here is my config:

          <tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
            <copy fromField="document.reference"  toField="document.filename" />
          </tagger>

         <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
            <pattern field="document.filename" valueGroup="1">
              .*[\\\/](.+)\..+
            </pattern>
          </tagger>

I make a copy of the document.reference field then use the textpatterntagger to get just the filename. However, the result is a the document.filename has the full path and the content field appended. What am I doing wrong?

Thanks

essiembre commented 7 years ago

Are you using the latest snapshot? TextPatternTagger was enhanced in Importer 2.8.0-SNAPSHOT to support valueGroup and fieldGroup. You can get these new attributes if you use the latest Filesystem Collector snapshot (which includes the latest Importer snapshot version).

jmrichardson commented 7 years ago

Thank you for your help. I just downloaded the snapshot, unzipped and put next to the other instance. I ran the following and get this error:


c:\Elastic>cd norconex-collector-filesystem-2.7.2-SNAPSHOT

c:\Elastic\norconex-collector-filesystem-2.7.2-SNAPSHOT>collector-fs.bat -a start -c c:\Elastic\ingest\crawler\config\config.xml
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli/DefaultParser
        at com.norconex.collector.core.AbstractCollectorLauncher.parseCommandLineArguments(AbstractCollectorLauncher.java:180)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:69)
        at com.norconex.collector.fs.FilesystemCollector.main(FilesystemCollector.java:76)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli.DefaultParser
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 3 more
c:\Elastic\norconex-collector-filesystem-2.7.2-SNAPSHOT>

Do i need to do anything special for the snapshot to work? I am using the same xml file that worked on the stable version. Thanks

PS. I can always go back to the stable version and get the 2.8 importer if that is recommended

essiembre commented 7 years ago

When you say "next to", do you mean in a separate directory? Because you can't really overwrite without worrying about Jars with duplicate versions. Please confirm.

Also, make sure you reinstall your committer (you can use the install script).

jmrichardson commented 7 years ago

Yes, in a separate directory. I did also install the elastic search commiter.

On Sep 19, 2017 6:09 PM, "Pascal Essiembre" notifications@github.com wrote:

When you say "next to", do you mean in a separate directory? Because you can't really overwrite without worrying about Jars with duplicate versions. Please confirm.

Also, make sure you reinstall your committer (you can use the install script).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-filesystem/issues/14#issuecomment-330688230, or mute the thread https://github.com/notifications/unsubscribe-auth/AGQumNDeF6dsRGWja2GsX_Jvf5b0VnJ-ks5skDuagaJpZM4PdArI .

essiembre commented 7 years ago

It appears the recent snapshot failed to package an updated dependency. commons-cli-1.2.jar needs to be upgraded to commons-cli-1.3.1.jar.

I will update this ticket when a new release is made to fix this.

In the meantime, you can download 1.3.1 here: http://apache.mirror.rafal.ca//commons/cli/binaries/ (the main site being: https://commons.apache.org/proper/commons-cli/). Store it in the lib folder after taking out version 1.2.

If will test all jars before deploying the fix, but if you find others, please advise.

essiembre commented 7 years ago

The issue with packaging the wrong dependencies has been resolved and a new snapshot has been deployed. Please confirm.

jmrichardson commented 7 years ago

Will do today and confirm. Thank you for all your help

On Sep 19, 2017 10:49 PM, "Pascal Essiembre" notifications@github.com wrote:

The issue with packaging the wrong dependencies has been resolved and a new snapshot https://www.norconex.com/collectors/collector-filesystem/download has been deployed. Please confirm.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-filesystem/issues/14#issuecomment-330730089, or mute the thread https://github.com/notifications/unsubscribe-auth/AGQumA_EPZx0fLZx3iqIzZJdxIQquPK1ks5skH0kgaJpZM4PdArI .

jmrichardson commented 7 years ago

I have installed the latest snapshot and it works great. However, I am still having an issue with the textpatterntagger. What I am looking to do is create a field with just the filename without the extension. In other words, take the path and extract out the filename into another field. My thought was to first use the copy tagger to create a new field with the copy of the document.reference which contains the path:

<tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
         <copy fromField="document.reference"  toField="document.filename" />
</tagger>

Then, use textpatterntagger to regex out just the filename from the newly created document.filename which contains the full path and leave only the filename:

<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
    <pattern field="document.filename" valueGroup="1">
        .*[\\\/](.+)\..+
    </pattern>
</tagger>

However, the result is i get the full pathname and the content field combined together in the new document.filename field.

I don't know how the content field is involved here unless the textpatterntagger only works on the content field? Not sure how to accomplish this?

Thanks

jmrichardson commented 7 years ago

Hi, I was able to get the above working using a different method with ScriptTagger:

        <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
          <script><![CDATA[
            metadata.addString('document.filename', 
              metadata.getString('document.reference').replace(/\.[^/.]+$/, "").replace(/^.*[\\\/]/,"")
            );
          ]]></script>
        </tagger>