Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
182 stars 68 forks source link

How to crawl sites with no file extensions on pages? #914

Closed svanschalkwyk closed 4 months ago

svanschalkwyk commented 6 months ago

Hi Pascal. I'm doing a POC for a client where every page is in a subdirectory, and there is no filename per se. They also have a sitemap, but that's in a cutesy format from their SEO provider, and I cannot get the web crawler to follow the links. image I am using this setup on V3:

<httpcollector id="Norconex Complex Collector"> 
  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core") 
  #set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer") 
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter") 
  #set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter") 
  #set($committerClass = "com.norconex.committer.core3.fs.impl.XMLFileCommitter") 
  <crawlerDefaults>
    <referenceFilters>
      <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css</filter>
    </referenceFilters>
    <documentFilters>
      <filter class="$filterExtension" onMatch="include">xml,html,htm</filter>
    </documentFilters>  
    <urlNormalizer class="$urlNormalizer">
      <normalizations>
        removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters,
        removeDotSegments
      </normalizations>
    </urlNormalizer>
    <maxDepth>-1</maxDepth>
    <numThreads>4</numThreads>
    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
  </crawlerDefaults>

  <crawlers>
    <crawler id="Remcam XXXXX">
      <startURLs
        stayOnDomain="true"
        includeSubdomains="true"
        stayOnPort="false"
        stayOnProtocol="false"
        async="true">
        <url>https://xxxxxlead.com/</url>
      </startURLs>
      <maxDepth>-1</maxDepth>
      <maxDocuments>-1</maxDocuments>
      <fetchHttpGet>OPTIONAL</fetchHttpGet>
      <robotsTxt ignore="false" class="StandardRobotsTxtProvider"/>
      <robotsMeta ignore="false" class="StandardRobotsMetaProvider" />
        <sitemapResolver ignore="false" lenient="true" class="GenericSitemapResolver">
        <path>/sitemap_index.xml</path>
      </sitemapResolver>
      <linkExtractors>
        <extractor class="HtmlLinkExtractor"  maxURLLength="2048" 
            ignoreNofollow="false" commentsEnabled="false" ignoreLinkData="false">
          <tags>
            <tag name="a" attribute="href" />
            <tag name="frame" attribute="src" />
            <tag name="iframe" attribute="src" />
            <tag name="img" attribute="src" />
            <tag name="meta" attribute="http-equiv" />
          </tags>
        </extractor>
      </linkExtractors>
      <committers>
        <committer class="$committerClass">
          <directory>./crawler_output</directory>
          <docsPerFile>1</docsPerFile>
          <compress>false</compress>
          <indent>4</indent>
        </committer>
      </committers>
    </crawler>
  </crawlers>
</httpcollector>

Anything you could recommend? Thanks S

essiembre commented 6 months ago

I can't see anything wrong at first glance. Do you have a sample sitemap file you can share (with the domain name obfuscated if you prefer)?

Something else you can try is to define the sitemap as a start URL:

  <startURLs ...>
    <sitemap>... here ...</sitemap>
  </startURLs>

The above with a <maxDepth> of zero should stick to the sitemap.

svanschalkwyk commented 6 months ago

Hi Pascal Thank you for the quick reply. I crawled it successfully with a Selenium proxy in Python, then noticed you already have that functionality. I'll add the web driver today and report back. Steph

On Tue, Feb 27, 2024, 00:54 Pascal Essiembre @.***> wrote:

I can't see anything wrong at first glance. Do you have a sample sitemap file you can share (with the domain name obfuscated if you prefer)?

Something else you can try is to define the sitemap as a start URL:

<startURLs ...>

... here ...

The above with a of zero should stick to the sitemap.

— Reply to this email directly, view it on GitHub https://github.com/Norconex/crawlers/issues/914#issuecomment-1965897772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABTCTGIXBD5WVTEZXGYSQLYVV7J5AVCNFSM6AAAAABD3ASOTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRVHA4TONZXGI . You are receiving this because you authored the thread.Message ID: @.***>

stale[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.