Closed hardreddata closed 4 years ago
I suspect this happens because directories are also considered paths for the purpose of filtering (and they do not match your extension). You likely have to tell it to include your directory as well. Not the simplest when it only takes one inclusion rule to match for a doc to go through. You can try moving your extension filter under <metadataFilters>
instead to see if it improves speed a bit. For doing it all in the reference filters, I suggest you try something like this instead (not tested):
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
.*/examples/files([^\.]+|.*\.docx)$
</filter>
</referenceFilters>
Thanks for the prompt response.
The metadataFilters section was new to me.
My use case is a network drive and I will take some time to understand the regular expression.
The example you provided above isn't working here. I can (.*([^\/]+\/?)*)
to match anything but cannot work out how to filter on file extension. I am confident it will come to me in time.
Hi,
Thanks again for sharing this crawler.
I feel like this should work
with variables
But I get nothing back
It works if I scan everything and then want to review files it has already found.
I think I can do the below inside
<importer>
but was hoping to target the search early on to increase speed.Advice invited.
Thanks!