Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
21 stars 13 forks source link

Troubles with SMB/CIFS on Windows mapped network drive #54

Closed hardreddata closed 4 years ago

hardreddata commented 4 years ago

Hi,

Thanks for making this tool available. It is very useful. I have it working when scanning local drives.

I am running the latest 2.9 snapshot and have patched the CIFS .jar per https://github.com/Norconex/collector-filesystem/issues/49

Note that I had to get it from https://mvnrepository.com/artifact/jcifs/jcifs/1.3.17 as the link on the norconex website no longer works ( http://central.maven.org/maven2/jcifs/jcifs/1.3.17/jcifs-1.3.17.jar )

I did explore patching commons-vfs but per https://github.com/Norconex/collector-filesystem/issues/3 but I think this is no longer required?

The config (where domain and password1 are replaced)

    <logsDir>${workdir}/logs</logsDir>
    <progressDir>${workdir}/progress</progressDir>
    <crawlers>
        <crawler id="Sample Crawler">

            <workDir>${workdir}</workDir>
            <startPaths>
                <path>${path}</path>
            </startPaths>
            <numThreads>2</numThreads>
            <keepDownloads>false</keepDownloads>
            <optionsProvider class="com.norconex.collector.fs.option.impl.GenericFilesystemOptionsProvider">
                <!-- Authentication (any file system) -->
                <authDomain>domain</authDomain>
                <authUsername>russell.grew</authUsername>
                <authPassword>password1</authPassword>
            </optionsProvider>  

            <committer class="com.norconex.committer.core.impl.XMLFileCommitter">
            <directory>${workdir}/xml</directory>
            <pretty>true</pretty>
            <docsPerFile>100</docsPerFile>
            <compress>false</compress>
            <splitAddDelete>false</splitAddDelete>
            </committer>

        </crawler>
    </crawlers>
</fscollector>

With variables

path = s:/RussellG/crawlme
workdir = ./examples-output

Noting that s:/ is a mapped network drive.

Throws

INFO  [JobSuite] Running Sample Crawler: BEGIN (Fri Aug 14 08:09:21 AEST 2020)
INFO  [FilesystemCrawler] 1 start paths identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Sample Crawler: Crawling references...
ERROR [SpecificLocalFileFetcher] Could not retreive SMB ACL data.
java.nio.file.NoSuchFileException: \RussellG\crawlme\My file.docx
        at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:85)
        at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
        at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108)
        at java.base/sun.nio.fs.WindowsLinkSupport.getFinalPath(WindowsLinkSupport.java:107)
        at java.base/sun.nio.fs.WindowsAclFileAttributeView.getOwner(WindowsAclFileAttributeView.java:120)
        at com.norconex.collector.fs.fetch.impl.SpecificLocalFileFetcher.fetchAcl(SpecificLocalFileFetcher.java:77)
        at com.norconex.collector.fs.fetch.impl.SpecificLocalFileFetcher.fetchFileSpecificMeta(SpecificLocalFileFetcher.java:55)
        at com.norconex.collector.fs.fetch.impl.GenericFileMetadataFetcher.fetchMetadada(GenericFileMetadataFetcher.java:75)
        at com.norconex.collector.fs.pipeline.importer.FileImporterPipeline$FileMetadataFetcherStage.executeStage(FileImporterPipeline.java:153)
        at com.norconex.collector.fs.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.fs.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.fs.crawler.FilesystemCrawler.executeImporterPipeline(FilesystemCrawler.java:228)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:829)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
INFO  [CrawlerEventManager] DOCUMENT_METADATA_FETCHED: file:///s:/RussellG/crawlme/My file.docx
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: file:///s:/RussellG/crawlme/My file.docx

Any advice is very welcome.

essiembre commented 4 years ago

The issue seems related to your starting path. If your drive is already mapped for the account running the crawler and you are not concerned with document ACLs, you should not need to add CIFS support.

If you want to extract the ACLs and crawl it using SMB/CFIS protocol, you likely need to specify your start path like this:

smb://hostname/pathToMappedDir/RussellG/crawlme
hardreddata commented 4 years ago

Thanks for your help.

The smb paths result in the error below.

INFO  [AbstractCrawler] Sample Crawler: Crawling references...
INFO  [AbstractCrawler] Sample Crawler: 10% completed (1 processed/10 total)
ERROR [SpecificSmbFetcher] Could not retreive SMB ACL data.
jcifs.smb.SmbException: The handle is invalid.
        at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563)
        at jcifs.smb.SmbTransport.send(SmbTransport.java:663)
        at jcifs.smb.SmbSession.send(SmbSession.java:238)
        at jcifs.smb.SmbTree.send(SmbTree.java:119)
        at jcifs.smb.SmbFile.send(SmbFile.java:775)
        at jcifs.smb.SmbFile.close(SmbFile.java:1023)
        at jcifs.smb.SmbFile.getSecurity(SmbFile.java:2904)
        at jcifs.smb.SmbFile.getSecurity(SmbFile.java:2975)
        at com.norconex.collector.fs.fetch.impl.SpecificSmbFetcher.fetchFileSpecificMeta(SpecificSmbFetcher.java:69)
        at com.norconex.collector.fs.fetch.impl.GenericFileMetadataFetcher.fetchMetadada(GenericFileMetadataFetcher.java:75)
        at com.norconex.collector.fs.pipeline.importer.FileImporterPipeline$FileMetadataFetcherStage.executeStage(FileImporterPipeline.java:153)
        at com.norconex.collector.fs.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.fs.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.fs.crawler.FilesystemCrawler.executeImporterPipeline(FilesystemCrawler.java:228)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:829)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

The only ACL I am really interested in is owner and I can live without it. I think another way forward for me would be if this ACL stuff could be disabled via config.

Any advice is very welcome.

essiembre commented 4 years ago

It turns out the first problem you got is due to not being able to extract the ACL when the Windows drive is different than the one the crawler is running on (e.g., C: vs S:). I just made a new 2.9.1-SNAPSHOT Filesystem Collector release with a fix for this.

Please give it a try and confirm.

hardreddata commented 4 years ago

Thanks for the prompt responses.

The fix worked great for path = s:/RussellG/crawlme.

As you suggested I did not need to add CIFS.