Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Unable to crawl sitemap entries with images #761

Closed punkch closed 3 years ago

punkch commented 3 years ago

Hello,

I am crawling a website, where some entries in the sitemap will have images like so:

<url>
    <loc>https://example.com/about</loc>
    <lastmod>2021-01-28T16:11:08+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.7</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/about" />
    <xhtml:link rel="alternate" hreflang="en" href="https://example.com/about" />
    <xhtml:link rel="alternate" hreflang="fr" href="https://example.com/fr/about" />
    <xhtml:link rel="alternate" hreflang="es" href="https://example.com/es/about" />
    <image:image>
      <image:loc>https://https://example.com/about/photos/some_image.jpg</image:loc>
      <image:title>About Image</image:title>
    </image:image>
  </url>

In my config, I also have a referenceFilter to exlude by extension:

<referenceFilters>
  <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpg,jpeg,gif,png,ico,css,js,svg,pdf</filter>
</referenceFilters>

During the sitemap processing, the crawler will correctly reject the filtered image urls by extension, but will not crawl the url in the tag. (i.e. will reject example.com/about/photos/some_image.jpg but won't crawl example.com/about)

Removing the image tags from the sitemap resolves the issue.

I am using the latest 2.9.1-SNAPSHOT

essiembre commented 3 years ago

Are you sure you are using the latest snapshot? This issue has been fixed on July 28th: #758.

punkch commented 3 years ago

It's the one I downloaded from the website. The date says July, 28, but the file timestamp says July, 12th. I will build from source and try again. Thank you for the quick reply.

punkch commented 3 years ago

I've build it from source and the issue is still present. I am confident that I am using the latest code in the 2x branch. This is what the generated output of the apidocs say for the version:

cat apidocs/overview-summary.html | head -n 10
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- NewPage -->
<html lang="en">
<head>
<!-- Generated by javadoc (1.8.0_232) on Mon Jul 12 02:16:20 EDT 2021 -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Overview (Norconex HTTP Collector 2.9.1-SNAPSHOT API)</title>
<meta name="date" content="2021-07-12">
<link rel="stylesheet" type="text/css" href="stylesheet.css" title="Style">
<script type="text/javascript" src="script.js"></script>

And this is my Dockerfile

FROM openjdk:8-jdk-stretch as base
ARG CRAWLER_HOME=/opt/collector-http
ENV CRAWLER_HOME=${CRAWLER_HOME}
ARG COLLECTOR_VERSION=2.9.1-SNAPSHOT
ENV COLLECTOR_VERSION=$COLLECTOR_VERSION
ARG COMMITER_VERSION=0.0.5
ENV COMMITER_VERSION=$COMMITER_VERSION

FROM base
RUN apt-get update && \
    apt-get install -y \
      git \
      maven \
      unzip

# build collector
#   git checkout tags/norconex-collector-http-${COLLECTOR_VERSION} && \
RUN  cd /tmp && \
  git clone https://github.com/Norconex/collector-http && \
  cd collector-http/ && \
  git checkout 2.x-branch && \
  cd norconex-collector-http/ && \
  mvn package -DskipTests && \
  mkdir -p /tmp/dist && \
  unzip target/norconex-collector-http-${COLLECTOR_VERSION}.zip -d /tmp/dist

COPY ./norconex-collector-http-2.9.1-SNAPSHOT /tmp/dist/norconex-collector-http-2.9.1-SNAPSHOT

RUN ls -al /tmp/dist
# build cloudsearch committer
RUN cd /tmp && \
  git clone https://github.com/google-cloudsearch/norconex-committer-plugin.git  && \
  cd norconex-committer-plugin && git checkout tags/v1-${COMMITER_VERSION}  && \
  mvn package -DskipTests && \
  mkdir -p /tmp/dist  && \
  unzip target/google-cloudsearch-norconex-committer-plugin-v1-${COMMITER_VERSION}.zip -d /tmp/dist

FROM base
RUN groupadd norconex && \
    useradd --create-home --shell /bin/bash -g norconex norconex && \
    apt-get update && \
    apt-get install -y \
      jq

RUN mkdir -p ${CRAWLER_HOME} && \
    mkdir -p ${CRAWLER_HOME}/config && \
    mkdir -p ${CRAWLER_HOME}/output && \
    mkdir -p ${CRAWLER_HOME}/cloudsearch

COPY --from=1 /tmp/dist/norconex-collector-http-${COLLECTOR_VERSION} ${CRAWLER_HOME}
COPY --from=1 /tmp/dist/google-cloudsearch-norconex-committer-plugin-v1-${COMMITER_VERSION} ${CRAWLER_HOME}/cloudsearch
# install cloudsearch comitter. send 1 to stdin when prompted
RUN cd ${CRAWLER_HOME}/cloudsearch && \
  /bin/bash -c "java -Dfile.encoding=UTF8 -cp \"./lib/*:../lib/*\" com.norconex.commons.lang.jar.JarCopier \"./lib\" \"${CRAWLER_HOME}/lib\" <<< \"1\""

RUN chown -R norconex:norconex ${CRAWLER_HOME} && chmod -R 755 ${CRAWLER_HOME}

COPY ./entrypoint.sh /entrypoint.sh

USER norconex
WORKDIR ${CRAWLER_HOME}

I am of course willing to provide more details and help to troubleshoot the issue, just let me know what and how.

essiembre commented 3 years ago

I was able to reproduce. It has to do with namespace/tag name conflicts. I'll work on a fix.

essiembre commented 3 years ago

I just made a new 2.x snapshot release with a fix. Please give it a try and confirm.

punkch commented 3 years ago

Hi Pascal,

Thank you so much for looking into this.

However, it is still not working for me. I think it correctly identifies the URLs in the sitemap (800 in my case) but after this, it is only crawling 146 of these (the entries without images)

This is what the logs look like:

INFO  [CrawlerEventManager] Website Crawler:           REJECTED_FILTER: https://example.com/some-image.jpg(ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,jpeg,gif,png,ico,css,js,svg,pdf,caseSensitive=false])
INFO  [StandardSitemapResolver]          Resolved: https://example.com/sitemap.xml
INFO  [HttpCrawler] 800 start URLs identified.
INFO  [CrawlerEventManager] Website Crawler:           CRAWLER_STARTED
INFO  [AbstractCrawler] Website Crawler: Crawling references...
......
INFO  [AbstractCrawler] Website Crawler: 8% completed (13 processed/146 total)
essiembre commented 3 years ago

From what you are reporting, the issue with the sitemap appears to be fixed. In all likelihood, the missing URLs were rejected. Do you have an example of URL in your sitemap that is not crawled? If so, check the logs for it, to see if it was rejected. You may have to change the log4j log level to see more information.

essiembre commented 3 years ago

Hello Pencho, I tried to reproduce using the material for this ticket you sent me directly.

I was able to confirm all 800 URLs in your sitemap were extracted properly. That confirms the sitemap fix works and the issue is with something else.

Unfortunately, I could not establish why some documents are rejected because I could not test further. URLs are being redirected to Google Auth.

I can share a few observations/ideas in case it helps you troubleshoot:

I see you pass a Bearer token in the fetcher HTTP header. Does that token expire? Also, is it usually enough for the crawler to go through? If it has to redirect to google to test the bearer token validity and come back, that could be an issue because the Google redirect URLs will get rejected by your <startURLs stayOn... directives (as it leaves the initial domain).

I also noticed the crawler tries to locate additional sitemaps at the root of your domain. To prevent this, you can disable the sitemap resolver (<sitemapResolverFactory ignore="true" ...). I know it is not intuitive to disable it because you are relying on sitemap in your start URL... but the sitemap resolver's primary task is to try to "resolve" the sitemaps by itself (i.e., trying to locate it). Since you are specifying it explicitly as a start URL, you do not need the "resolver". I do not think that is linked to your issue though.

Your config otherwise looks fine. Can it be that the reject pages have all their content stripped because of the StripBetweenTransformer somehow? You may want to try without them to see if that makes a difference.

Can you try replacing your sitemap entry with a <url> containing a "faulty" document? That could tell us if the problem is specific to those pages.

There is also version 3 you may want to try if you do not mind it not being an official release yet.

Hopefully, any of the above helps. If not, please describes what happens when you try to index a single URL directly. Attach the logs and maybe the HTML file for that page (which you can send directly if too sensitive).

punkch commented 3 years ago

Hi Pascal,

Sorry for making it so hard to reproduce. I've enabled the sitemap images for the website I am trying to crawl and could reproduce the issue. It is public and doesn't require any tokens in the header, no redirects to google auth, etc... Also, thanks to your pointers, I could also simplify my config a lot (removing StripBetweenTransformer, ignore=true for the sitemapResolverFactory).

All the configs, logs and pretified versions of the sitemap xmls are available in this archive: logs-and-configs.zip

So with this config:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="UNOPS Custom HTTP Collector">
  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($committer = "com.norconex.committer")
  #set($httpClientFactory   = "${http}.client.impl.GenericHttpClientFactory")
  #set($urlNormalizer       = "${http}.url.impl.GenericURLNormalizer")
  #set($recrawlableResolver = "${http}.recrawl.impl.GenericRecrawlableResolver")
  #set($filterExtension     = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef      = "${core}.filter.impl.RegexReferenceFilter")
  #set($googleCommitter     = "com.norconex.committer.googlecloudsearch")
  #set($stripper            = "com.norconex.importer.handler.transformer.impl.StripBetweenTransformer")
  #set($renameTagger        = "com.norconex.importer.handler.tagger.impl.RenameTagger")
  #set($fsCommitter         = "${committer}.core.impl.FileSystemCommitter")

  <progressDir>./output/progress</progressDir>
  <logsDir>./output/logs</logsDir>
  <crawlers>
    <crawler id="Website Crawler">
      <httpClientFactory class="$httpClientFactory">
        <!-- HTTP request headers passed on every HTTP requests -->
        <cookiesDisabled>false</cookiesDisabled>
        <!-- Connection timeouts  -->
        <connectionTimeout>60000</connectionTimeout>
        <socketTimeout>60000</socketTimeout>
        <connectionRequestTimeout>60000</connectionRequestTimeout>
      </httpClientFactory>

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <sitemap>https://www.unops.org/sitemaps-2-section-projectAndLocationPages-1-sitemap.xml</sitemap>
        <!-- <url>https://www.unops.org/africa</url> -->
        <!-- <sitemap>https://www.unops.org/sitemaps-2-section-newsArticles-1-sitemap.xml</sitemap>  -->
        <!-- <url>https://www.unops.org/news-and-stories/insights/diseases-without-borders</url> -->
      </startURLs>

      <robotsTxt ignore="true"/>

      <sitemapResolverFactory ignore="true" lenient="true"></sitemapResolverFactory>

      <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>
          removeQueryString, removeFragment,lowerCaseSchemeHost, 
          upperCaseEscapeSequence, decodeUnreservedCharacters, 
          removeDefaultPort, removeDotSegments, removeSessionIds, 
          upperCaseEscapeSequence
        </normalizations>
      </urlNormalizer>

      <keepDownloads>true</keepDownloads>

      <maxDepth>0</maxDepth>
      <maxDocuments>-1</maxDocuments>
      <orphansStrategy>PROCESS</orphansStrategy>
      <delay default="500" />
      <workDir>./output</workDir>
      <numThreads>20</numThreads>

      <referenceFilters>
        <filter class="${filterExtension}" onMatch="exclude">jpg,jpeg,gif,png,ico,css,js,svg,pdf</filter>
      </referenceFilters>

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>/opt/collector-http/output/files</directory>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>

I've first tried crawling this sitemap: https://www.unops.org/sitemaps-2-section-projectAndLocationPages-1-sitemap.xml (00-projects-and-locations-sitemap.xml in the zip archive)

startURLs config

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <sitemap>https://www.unops.org/sitemaps-2-section-projectAndLocationPages-1-sitemap.xml</sitemap>
      </startURLs>

The first url (/project-locations) is the only one without an image and the only one processed.

<url>
  <loc>https://www.unops.org/project-locations</loc>
  <lastmod>2021-03-19T16:02:23+01:00</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
  <xhtml:link rel="alternate" hreflang="x-default" href="https://www.unops.org/project-locations" />
  <xhtml:link rel="alternate" hreflang="en" href="https://www.unops.org/project-locations" />
  <xhtml:link rel="alternate" hreflang="fr" href="https://www.unops.org/fr/project-locations" />
  <xhtml:link rel="alternate" hreflang="es" href="https://www.unops.org/es/project-locations" />
</url>

Logs (from file 1-sitemap-projectAndLocations.log in the zip)

Website Crawler: 2021-09-21 13:13:04 DEBUG - ACCEPTED document reference. Reference=https://www.unops.org/project-locations Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,jpeg,gif,png,ico,css,js,svg,pdf,caseSensitive=false]
Website Crawler: 2021-09-21 13:13:05 DEBUG - Queued for processing: https://www.unops.org/project-locations

The second one (/africa) is not crawled, and there is only a message that the image in the image:loc element is rejected because of the ExtensionReferenceFilter.

<url>
    <loc>https://www.unops.org/africa</loc>
    <lastmod>2021-06-16T08:59:49+02:00</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.6</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.unops.org/africa" />
    <xhtml:link rel="alternate" hreflang="en" href="https://www.unops.org/africa" />
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.unops.org/fr/africa" />
    <xhtml:link rel="alternate" hreflang="es" href="https://www.unops.org/es/africa" />
    <image:image>
      <image:loc>https://content.unops.org/photos/MG_3690_210318_143141.jpg?mtime=20210318143141&amp;focal=none</image:loc>
      <image:title>Mg 3690</image:title>
    </image:image>
</url>

Logs:

Website Crawler: 2021-09-21 13:13:05 DEBUG - REJECTED document reference . Reference=https://content.unops.org/photos/MG_3690_210318_143141.jpg?mtime=20210318143141&focal=none Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,jpeg,gif,png,ico,css,js,svg,pdf,caseSensitive=false]
Website Crawler: 2021-09-21 13:13:05 INFO - Website Crawler:           REJECTED_FILTER: https://content.unops.org/photos/MG_3690_210318_143141.jpg?mtime=20210318143141&focal=none (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,jpeg,gif,png,ico,css,js,svg,pdf,caseSensitive=false])

If I add the url in addtion to the sitemap it is processed successfully. There is still a REJECTED message about the image.

startURLs

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
  <sitemap>https://www.unops.org/sitemaps-2-section-projectAndLocationPages-1-sitemap.xml</sitemap>
  <url>https://www.unops.org/africa</url>
</startURLs>

Logs (2-sitemap-projectsAndLocation-africa-url.log)

Website Crawler: 2021-09-21 13:43:53 DEBUG - Queued for processing: https://www.unops.org/project-locations
Website Crawler: 2021-09-21 13:43:53 DEBUG - REJECTED document reference . Reference=https://content.unops.org/photos/MG_3690_210318_143141.jpg?mtime=20210318143141&focal=none Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,jpeg,gif,png,ico,css,js,svg,pdf,caseSensitive=false]
Website Crawler: 2021-09-21 13:43:53 INFO - Website Crawler:           REJECTED_FILTER: https://content.unops.org/photos/MG_3690_210318_143141.jpg?mtime=20210318143141&focal=none (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,jpeg,gif,png,ico,css,js,svg,pdf,caseSensitive=false])
...
Website Crawler: 2021-09-21 13:43:54 INFO -          Resolved: https://www.unops.org/sitemaps-2-section-projectAndLocationPages-1-sitemap.xml
Website Crawler: 2021-09-21 13:43:54 DEBUG - ACCEPTED document reference. Reference=https://www.unops.org/africa Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,jpeg,gif,png,ico,css,js,svg,pdf,caseSensitive=false]
Website Crawler: 2021-09-21 13:43:54 DEBUG - Queued for processing: https://www.unops.org/africa
Website Crawler: 2021-09-21 13:43:54 INFO - 65 start URLs identified.
Website Crawler: 2021-09-21 13:43:54 INFO - Website Crawler:           CRAWLER_STARTED

In addition, I've tried with a more complicated sitemap, where there is more than image for some of the urls. (01-news-articles-sitemap.xml in the archive)

This is an example of an url from this sitemap.

<url>
  <loc>https://www.unops.org/news-and-stories/insights/diseases-without-borders</loc>
  <lastmod>2019-04-24T14:47:16+02:00</lastmod>
  <changefreq>daily</changefreq>
  <priority>0.9</priority>
  <xhtml:link rel="alternate" hreflang="x-default" href="https://www.unops.org/news-and-stories/insights/diseases-without-borders" />
  <xhtml:link rel="alternate" hreflang="en" href="https://www.unops.org/news-and-stories/insights/diseases-without-borders" />
  <xhtml:link rel="alternate" hreflang="fr" href="https://www.unops.org/fr/news-and-stories/insights/diseases-without-borders" />
  <xhtml:link rel="alternate" hreflang="es" href="https://www.unops.org/es/news-and-stories/insights/diseases-without-borders" />
  <image:image>
    <image:loc>https://content.unops.org/photos/Cambodia-Elise-Laker-L1030104.jpg?mtime=20190415171225&amp;focal=none</image:loc>
    <image:title>Cambodia Elise Laker L1030104</image:title>
  </image:image>
  <image:image>
    <image:loc>https://content.unops.org/photos/News-and-Stories/Features/Cambodia-Elise-Laker-L1030104.jpg?mtime=20190409171744&amp;focal=none</image:loc>
    <image:title>Cambodia Elise Laker L1030104</image:title>
  </image:image>
</url>

In the logfiles (3-sitemap-newsAndStories.log and 4-sitemap-newsAndStories-and-an-url.log) there is only mention of the second image. The url (/news-and-stories/insights/diseases-without-borders) is not mentioned at all unless specifically specified as an url element in the startURLs.

If I remove the ExtensionReferenceFilter, it just downloads the images from the tags. (see 5-sitemap-projectAndLocations-with-images.log)

essiembre commented 3 years ago

I'd like to ask: how do up update the version? Do you simply extract the snapshot zip on top of your existing installation directory? I am starting to suspect you have old Jars lying around in your <install_dir>/lib folder. Can you check if you have more than one Jar starting with norconex-collector-http ? in that lib folder? If so, make sure you get rid of older ones. Please confirm.

punkch commented 3 years ago

Hi Pascal,

Confirm. This issue is resolved and urls are properly extracted from sitemaps with images.

Indeed my problem was outdated jars. Basically, I build the binaries from source with the Dockerfile I posted further above.

However, while testing initially with the latest snapshot download, I've added the below copy command that forgot to remove.

RUN  cd /tmp && \
  git clone https://github.com/Norconex/collector-http && \
  cd collector-http/ && \
  git checkout 2.x-branch && \
  cd norconex-collector-http/ && \
  mvn package -DskipTests && \
  mkdir -p /tmp/dist && \
  unzip target/norconex-collector-http-${COLLECTOR_VERSION}.zip -d /tmp/dist

COPY ./norconex-collector-http-2.9.1-SNAPSHOT /tmp/dist/norconex-collector-http-2.9.1-SNAPSHOT

Effectively overwriting the jars built from source with those from the outdated download.

Please accept my sincere apologies for wasting so much of your time and thank you so much for bearing with me.