Closed jetnet closed 7 years ago
First, thank you for your great feedback!
About SMB support, it may already be possible. The Filesystem Collector uses Apache Commons VFS and I noticed they have SMB support in their "sandbox". I have no idea how well it is supported, but it is probably worth having a try.
Are you working from the code or from the binaries? Assuming binaries, here is what I would do to try it out (not tested):
commons-vfs2-2.0.jar
found under the lib
folder of where you installed the collector. You can now try it. The path format is the following (according to Apache documentation):
smb://[ username[: password]@] hostname[: port][ absolute-path]
Examples:
smb://somehost/home
I am not sure if the "smb" protocol will be automatically registered.
Let me know if you are successful with this approach.
Hi Pascal, thank you for the quick reply!
I believe the correct URL for the latest 2.1-SNAPSHOT VFS jar: http://repository.jboss.org/org/apache/commons/commons-vfs2/2.1-SNAPSHOT/
Just a quick test:
INFO [CrawlerEventManager] REJECTED_ERROR: smb://localhost/share/docs
ERROR [AbstractCrawler] Sample Crawler: Could not process document: smb://localhost/share/docs (Cannot resolve: smb://localhost/share/docs)
com.norconex.collector.fs.FilesystemCollectorException: Cannot resolve: smb://localhost/share/docs
at com.norconex.collector.fs.crawler.FilesystemCrawler.executeImporterPipeline(FilesystemCrawler.java:161)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:487)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:377)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:735)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.commons.vfs2.FileSystemException: Badly formed URI "smb://localhost/share/docs".
at org.apache.commons.vfs2.provider.url.UrlFileProvider.findFile(UrlFileProvider.java:90)
at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:823)
at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:760)
at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:709)
at com.norconex.collector.fs.crawler.FilesystemCrawler.executeImporterPipeline(FilesystemCrawler.java:158)
... 6 more
Caused by: java.net.MalformedURLException: unknown protocol: smb
at java.net.URL.<init>(URL.java:600)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
at org.apache.commons.vfs2.provider.url.UrlFileProvider.findFile(UrlFileProvider.java:71)
... 10 more
So, looks like the smb protocol does not get registered automatically. But anyway, thank you for pointing to the direction. I'll look into this!
You are right about the URL.
Apparently you can add this to your classpath: /META-INF/vfs-providers.xml
That file can be used to have custom mapping of protocols (schemes). More info here: https://commons.apache.org/proper/commons-vfs/api.html
Let me know if you can make it work. We may consider adding the sandbox jar to the collector if you find it stable enough.
WOW! It works! you are genius! :)
the full URI looks like this:
smb://domain\user:password@host/share/path
So, a journey of a thousand miles begins with a single step :) The meta-data do not contain the most important part - the ACLs. Something like this:
"allow_token_document": [
"S-1-15-2-1",
"S-1-5-18",
"S-1-5-32-544",
"S-1-5-32-545",
"S-1-5-80-9564540085-343452649-181234038044-18132492631-2234354464"
],
I guess, the data come from JCIFS, could you please look into this, and if possible to add the data to the response object? Thank you for the great support!
You're the genius that just made it work, but thanks! Grabbing the ACL may be tricky if JCIFS or Apache VFS does not extract it but hopefully they do. I am turning this into a feature request.
I'm only a dummy, who just followed the brilliant instructions :) So, just a small note regarding the file share credentials: the password be encrypted in the config file, but in the logfile and the output documents contain the whole URL with the clear text credentials. Maybe it makes senses to supply them as an additional config parameter and hide in the logging and meta-data. Thanks!
hi there, in order to make it work do you need an enviroment on Windows share o Samba (Linux) share? and what is your requirements to access it? Remote desktop or ssh? Regards
I actually found an environment I think I can use. I will give it a try and let you know that environment is not adequate. So far things look quite promising though. Thanks a lot for asking. Stay tuned...
The latest snapshot now has better CIFS/SMB integration. All you need to do once you extracted the zip is download the JCIFS Jar file located here and add it to te "lib" folder. Unfortunately it cannot be "distributed" with the zip due to license incompatibilities (LGPL 2.1). I will document that requirement better once you confirm it works fine.
ACLs are now extracted in the form of "collector.acl.smb[x].yyyyy" (e.g. "collector.acl.smb[0].sid"). You will find 8 different ACL properties stored that way (ace, sid, sidAsText, type, typeAsText, domainSid, domainName, and acountName).
In addition, it is possible to supply the password as a file system option instead of having it in the URL. To do so, you can add this under your <crawler ...>
tag:
<optionsProvider class="com.norconex.collector.fs.option.impl.GenericFilesystemOptionsProvider">
<authDomain>YourDomain</authDomain>
<authUsername>username</authUsername>
<authPassword>password</authPassword>
</optionsProvider>
You can also encrypt the password. Look at the GenericFilesystemOptionsProvider class for more options. Alternatively, Apache Commons CFS lets you encrypt the password in their own way when provided in the URL. See how under the "Naming" section here: http://commons.apache.org/proper/commons-vfs/filesystems.html
Please let me know how that goes for you.
I am configuring every step of the way, and let you know, now I am in the same problem that I am with manifoldCF (another crawler you might know). my issue now is that I have a Username and on Solr we have Group SID and user SID. with manifold we have a: "MCF authority service". do you have plans to create something similar? also could you please provide me with the list of JARS on this ne version of your product? as I have a list with the 2.6 version now with this new version is useless. thanks a lot for your time angelo
We have currently no plan to build a group server/authority service as it is out of scope from crawling.
Nothing prevents you to keep using Manifold authority service with the crawler of your choice. What's important is to have the ACL properly stored with the documents you index. The latest snapshot addresses that part (you may have to rename the fields to match what your Solr search component expects).
If you can have the user authenticate and/or grab the groups from your security source, you may do it without an authority service if you can pass the groups as an encrypted token to your custom Solr search component that decrypts it and compares it with the document ACL.
As for the Jars you need, they are all in the lib folder of the downloadable zip, except for that jcifs jar mentioned earlier.
thanks a lot for your help, I will try to include all information first and later integrate that Services from manifold. also could you please provide for me that list of versions on the JARs that i should keep from the importer and SOLR crawler? Best Regards angelo
Rather than trying to find out which jars to keep, updates are easier when you install in a new location and reinstall your committer (and the JCIF jar in this case). It is also eaiser to rollback that way if you run into issues with a new release. Your config can reside anywhere.
If you prefer going the hard ways, expand the Filesystem Collector snapshot zip somewhere and check the lib folder for the list of jars you need (it includes all jars it needs except your committer jars and jcifs jar).
Since you can get the ACL now, I am closing this. Re-open or create a new ticket if needed.
hi Norconex team,
not an issue at all, but are there any plans to add JCIFS support to the filesystem crawler? It would be great, if the best crawler framework (it's not a joke! I've used many of them...) could index smb shares. Thanks!