Closed javpdiaz closed 6 years ago
The reference filters are taking places before a document is downloaded. In your case, if you want to keep PDF and PPT, it needs to crawl a bunch of HTML pages before it gets to those. You are not giving it a chance to get there.
There are a few ways around this, but one is to use document filters instead, which takes place after documents are downloaded and links extracted:
<documentFilters>
<filter class="$filterExtension">pdf,ppt</filter>
</documentFilters>
You can get the execution flow here.
Let me know if that resolves your issue.
It works perfect, many thanks
Great. Thanks for confirming.
I need to extract only a certain type of files from a repository, for example the .pdf, ppt, ... I am using this configuration but it does not work.
I know it's a problem with the filters but I do not know how to fix it.
If I leave it this way if it works the crawler but it sends me many elements for the Solr that I do not need.