Closed jmrichardson closed 7 years ago
Are you using the latest snapshot? TextPatternTagger was enhanced in Importer 2.8.0-SNAPSHOT to support valueGroup
and fieldGroup
. You can get these new attributes if you use the latest Filesystem Collector snapshot (which includes the latest Importer snapshot version).
Thank you for your help. I just downloaded the snapshot, unzipped and put next to the other instance. I ran the following and get this error:
c:\Elastic>cd norconex-collector-filesystem-2.7.2-SNAPSHOT
c:\Elastic\norconex-collector-filesystem-2.7.2-SNAPSHOT>collector-fs.bat -a start -c c:\Elastic\ingest\crawler\config\config.xml
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli/DefaultParser
at com.norconex.collector.core.AbstractCollectorLauncher.parseCommandLineArguments(AbstractCollectorLauncher.java:180)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:69)
at com.norconex.collector.fs.FilesystemCollector.main(FilesystemCollector.java:76)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli.DefaultParser
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 3 more
c:\Elastic\norconex-collector-filesystem-2.7.2-SNAPSHOT>
Do i need to do anything special for the snapshot to work? I am using the same xml file that worked on the stable version. Thanks
PS. I can always go back to the stable version and get the 2.8 importer if that is recommended
When you say "next to", do you mean in a separate directory? Because you can't really overwrite without worrying about Jars with duplicate versions. Please confirm.
Also, make sure you reinstall your committer (you can use the install script).
Yes, in a separate directory. I did also install the elastic search commiter.
On Sep 19, 2017 6:09 PM, "Pascal Essiembre" notifications@github.com wrote:
When you say "next to", do you mean in a separate directory? Because you can't really overwrite without worrying about Jars with duplicate versions. Please confirm.
Also, make sure you reinstall your committer (you can use the install script).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-filesystem/issues/14#issuecomment-330688230, or mute the thread https://github.com/notifications/unsubscribe-auth/AGQumNDeF6dsRGWja2GsX_Jvf5b0VnJ-ks5skDuagaJpZM4PdArI .
It appears the recent snapshot failed to package an updated dependency. commons-cli-1.2.jar
needs to be upgraded to commons-cli-1.3.1.jar
.
I will update this ticket when a new release is made to fix this.
In the meantime, you can download 1.3.1 here: http://apache.mirror.rafal.ca//commons/cli/binaries/ (the main site being: https://commons.apache.org/proper/commons-cli/). Store it in the lib
folder after taking out version 1.2.
If will test all jars before deploying the fix, but if you find others, please advise.
The issue with packaging the wrong dependencies has been resolved and a new snapshot has been deployed. Please confirm.
Will do today and confirm. Thank you for all your help
On Sep 19, 2017 10:49 PM, "Pascal Essiembre" notifications@github.com wrote:
The issue with packaging the wrong dependencies has been resolved and a new snapshot https://www.norconex.com/collectors/collector-filesystem/download has been deployed. Please confirm.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-filesystem/issues/14#issuecomment-330730089, or mute the thread https://github.com/notifications/unsubscribe-auth/AGQumA_EPZx0fLZx3iqIzZJdxIQquPK1ks5skH0kgaJpZM4PdArI .
I have installed the latest snapshot and it works great. However, I am still having an issue with the textpatterntagger. What I am looking to do is create a field with just the filename without the extension. In other words, take the path and extract out the filename into another field. My thought was to first use the copy tagger to create a new field with the copy of the document.reference which contains the path:
<tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
<copy fromField="document.reference" toField="document.filename" />
</tagger>
Then, use textpatterntagger to regex out just the filename from the newly created document.filename which contains the full path and leave only the filename:
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
<pattern field="document.filename" valueGroup="1">
.*[\\\/](.+)\..+
</pattern>
</tagger>
However, the result is i get the full pathname and the content field combined together in the new document.filename field.
I don't know how the content field is involved here unless the textpatterntagger only works on the content field? Not sure how to accomplish this?
Thanks
Hi, I was able to get the above working using a different method with ScriptTagger:
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
<script><![CDATA[
metadata.addString('document.filename',
metadata.getString('document.reference').replace(/\.[^/.]+$/, "").replace(/^.*[\\\/]/,"")
);
]]></script>
</tagger>
Hi,
I am trying to get the textpatterngagger (suggest something different if this is not a good idea) to get just the filename from the document path (document.reference). Here is my config:
I make a copy of the document.reference field then use the textpatterntagger to get just the filename. However, the result is a the document.filename has the full path and the content field appended. What am I doing wrong?
Thanks