Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
21 stars 13 forks source link

Crawl only for certain file extensions? #56

Closed hardreddata closed 4 years ago

hardreddata commented 4 years ago

Hi,

Thanks again for sharing this crawler.

I feel like this should work

<fscollector id="Documents">

    <logsDir>${workdir}\logs</logsDir>
    <progressDir>${workdir}\progress</progressDir>
    <crawlers>
        <crawler id="Sample Crawler">

            <workDir>${workdir}</workDir>
            <startPaths>
                <path>${path}</path>
            </startPaths>
            <numThreads>2</numThreads>
            <keepDownloads>false</keepDownloads>
            <referenceFilters>
                <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include" >docx</filter> 
            </referenceFilters>
            <committer class="com.norconex.committer.core.impl.XMLFileCommitter">
            <directory>${workdir}\xml</directory>
            <pretty>true</pretty>
            <docsPerFile>100</docsPerFile>
            <compress>false</compress>
            <splitAddDelete>false</splitAddDelete>
            </committer>

        </crawler>
    </crawlers>
</fscollector>

with variables

path = ./examples/files
workdir = ./examples-output

But I get nothing back

INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=INCLUDE,extensions=docx,caseSensitive=false]
INFO  [AbstractCollectorConfig] Configuration loaded: id=Documents; logsDir=./examples-output\logs; progressDir=./examples-output\progress
INFO  [JobSuite] JEF work directory is: .\examples-output\progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex Filesystem Collector 2.9.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.10.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.10.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.2 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.3 (Norconex Inc.)
INFO  [JobSuite] Running Sample Crawler: BEGIN (Mon Sep 07 09:44:25 AEST 2020)
INFO  [CrawlerEventManager]           REJECTED_FILTER: D:\scratch\norconex\norconex-collector-filesystem-2.9.0\.\examples\files (No "include" reference filters matched.)
INFO  [FilesystemCrawler] 1 start paths identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Sample Crawler: Crawling references...
INFO  [AbstractCrawler] Sample Crawler: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] Sample Crawler: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Sample Crawler: 0 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Sample Crawler: Crawler completed.
INFO  [AbstractCrawler] Sample Crawler: Crawler executed in 0 second.
INFO  [JobSuite] Running Sample Crawler: END (Mon Sep 07 09:44:25 AEST 2020)

It works if I scan everything and then want to review files it has already found.

I think I can do the below inside <importer> but was hoping to target the search early on to increase speed.

 <preParseHandlers>
     <!-- to filter on file extension: -->
     <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.reference">
         <regex>.*(docx)$</regex>
     </filter>
 </preParseHandlers>

Advice invited.

Thanks!

essiembre commented 4 years ago

I suspect this happens because directories are also considered paths for the purpose of filtering (and they do not match your extension). You likely have to tell it to include your directory as well. Not the simplest when it only takes one inclusion rule to match for a doc to go through. You can try moving your extension filter under <metadataFilters> instead to see if it improves speed a bit. For doing it all in the reference filters, I suggest you try something like this instead (not tested):

<referenceFilters>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
       .*/examples/files([^\.]+|.*\.docx)$
  </filter> 
</referenceFilters>
hardreddata commented 4 years ago

Thanks for the prompt response.

The metadataFilters section was new to me.

My use case is a network drive and I will take some time to understand the regular expression.

The example you provided above isn't working here. I can (.*([^\/]+\/?)*) to match anything but cannot work out how to filter on file extension. I am confident it will come to me in time.