Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Sitemap path in robots.txt not recognized #726

Closed pipaltree closed 3 years ago

pipaltree commented 3 years ago

On a site with sitemap path specified in robots.txt Norconex doesn't recognize this specification. The SitemapResolverFactory is configured to respect only specifications from robots.txt by setting the empty path tag:

<sitemapResolverFactory ignore="false" lenient="true"
    class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory">
    <path/>
</sitemapResolverFactory>

Content of the robots.txt:

User-agent: *
Disallow: /fileadmin/_temp_/
Disallow: /t3lib/
Disallow: /typo3/
Disallow: /typo3_src/
Disallow: /typo3conf/
Disallow: /clear.gif
Allow: /typo3/sysext/frontend/Resources/Public/*
Sitemap: https://www.skd.museum/index.php?id=1&type=841132

I have enabled log level DEBUG for the collector in log4j.properties: log4j.logger.com.norconex.collector.http=DEBUG

But if I start crawling, the logging says "No sitemap paths specified.":

[non-job]: 2020-11-16 15:49:02 INFO - Starting execution.
[non-job]: 2020-11-16 15:49:02 INFO - Version: Norconex HTTP Collector 2.9.0 (Norconex Inc.)
[non-job]: 2020-11-16 15:49:02 INFO - Version: Norconex Collector Core 1.10.0 (Norconex Inc.)
[non-job]: 2020-11-16 15:49:02 INFO - Version: Norconex Importer 2.10.0 (Norconex Inc.)
[non-job]: 2020-11-16 15:49:02 INFO - Version: Norconex JEF 4.1.2 (Norconex Inc.)
[non-job]: 2020-11-16 15:49:02 INFO - Version: Norconex Committer Core 2.1.3 (Norconex Inc.)
[non-job]: 2020-11-16 15:49:02 INFO - Version: Norconex Committer Solr 2.4.0 (Norconex Inc.)
skd-de: 2020-11-16 15:49:02 INFO - Running skd-de: BEGIN (Mon Nov 16 15:49:02 CET 2020)
skd-de: 2020-11-16 15:49:02 INFO - skd-de: RobotsTxt support: true
skd-de: 2020-11-16 15:49:02 INFO - skd-de: RobotsMeta support: true
skd-de: 2020-11-16 15:49:02 INFO - skd-de: Sitemap support: true
skd-de: 2020-11-16 15:49:02 INFO - skd-de: Canonical links support: true
skd-de: 2020-11-16 15:49:02 INFO - skd-de: User-Agent: XIMA-Crawler
skd-de: 2020-11-16 15:49:02 INFO - skd-de: Initializing sitemap store...
skd-de: 2020-11-16 15:49:02 DEBUG - skd-de: Cleaning sitemap store...
skd-de: 2020-11-16 15:49:02 INFO - skd-de: Done initializing sitemap store.
skd-de: 2020-11-16 15:49:03 DEBUG - URL redirect: https://www.skd.museum/robots.txt -> https://www.skd.museum/robots.txt/
skd-de: 2020-11-16 15:49:03 DEBUG - Fetched and parsed robots.txt: https://www.skd.museum/robots.txt
skd-de: 2020-11-16 15:49:03 DEBUG - No sitemap paths specified.
skd-de: 2020-11-16 15:49:03 DEBUG - Sitemap locations: []

[...]

What could be the issue here?

essiembre commented 3 years ago

This occurred because the robots.txt is redirecting to robots.txt/. Since it is expected at a standard location, it tried to read it from robots.txt (no slash) and failed since it is not returning expected content. I just deployed a new 2.9.1-SNAPSHOT version that will follow the redirect. Please try and confirm.

pipaltree commented 3 years ago

Thanks for your reply and for providing the snapshot! At the moment I have plenty of work under my hands but I will try as soon as possible and give you feedback.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.