Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

How To Reject URLs indexed before a $date #712

Closed sudeshna-majumder closed 3 years ago

sudeshna-majumder commented 4 years ago

Hi Pascal,

I have a requirement of excluding all URLs that were indexed before a certain date , I need to exclude them next crawl onward. Now I have an additional situation as well. I added CurrentDateTagger to my config sometimes back (you suggested during helping me out with another issue ), and I also have older indexes which are having empty date field . Now how can I exclude all of them together ?

essiembre commented 4 years ago

Do you mean how to automate this? Normally the crawler will perform incremental crawls so you do not have to worry about this (keep track of addition, modifications, deletions).

If your use case mandates it, there are a few ways to go about it and it would help to know more about your setup. One that should be simple enough is to modify the collector launch script this way:

If I am not understanding your question properly, please clarify.

sudeshna-majumder commented 4 years ago

Hi Pascal,

Thanks for your quick reply.

My crawler crawls over 500 domains, among them I have identified 5 domains, for these domains I don't want to recrawl pages anymore that were indexed before July'2020. I don't want to delete them, I just don't want to recrawl them ever, even if those pages are being undated in between I don't need to index the undated pages and rather keep their older versions . I have written a dedicated crawler for these domains so that I can configure it differently according to the requirement . Now how can I configure it so that it ignores all the pages indexed before July'2020 ?

sudeshna-majumder commented 4 years ago

I have one more question.

I have indexed 2 domains under same collection . Now one of the domains has been disabled , the domain is no longer live . All the pages of that site is now canonically redirecting to a common page like "404.aspx" showing a message "document not available". But they are still appearing as top search result . As they don't have content anymore I don't want them to appear as search result . How can I remove them ? I have tried to add the url pattern in reference filter , with orphanStrategy as DELETE but seems it didn't work , they still appear as search result. Somehow I feel the crawler is not able to reach them in subsequent crawls . Is there any way to remove those pages ?

essiembre commented 4 years ago

Please open new tickets for new questions.

About ignoring docs before a certain date:

If you do not attempt to crawl pages that were already crawled, you will miss those that have changed. You may not care to index them, but if they changed, they may contain new links to new pages that you will not be able to crawl if they are not discovered. An exception to this is if you are relying only on sitemap.xml files only (for sites having them). In which case, new URLs should be made available through the sitemap.

Still, if you want to not recrawl pages that were already crawled regardless if they changed or not, you can use a metadata checksummer and make the crawler believe it had not changed by using the URL for creating the checksum (since the URL will remain the same for each run). Example:

  <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" />
  <metadataChecksummer class="com.norconex.collector.core.checksum.impl.GenericMetadataChecksummer">
    <sourceFields>document.reference</sourceFields>
  </metadataChecksummer>

I am not sure if the above can be useful though.

Another approach is to crawl them, but filter them out based on date in the Importer module. Example:

<filter class="com.norconex.importer.handler.filter.impl.DateMetadataFilter" onMatch="exclude" field="MY_DATE_FIELD" >
    <condition operator="lt" date="2020-08-12" />
</filter>

For it to work, the document has to contain a reliable date you can use. This is not always the case. For sitemaps, maybe collector.sitemap-lastmod could do it.

Another approach, without cracking the Collector code open, could be to store the URL in a non-unique field and use a UUID field instead for the ID. If you store the date crawled, with each document, you can then check URL duplicates as a post-process and delete those with older dates.

We can also make it a feature request to store in a new field the date a document was last crawled so you can use that for filtering in the Importer module.

About disabled domains:

If you change the orphan strategy after you notice the problem, it is then too late, and you will have to manually delete them from your index. Do a deletion query on your index for records matching your domain.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.