Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

how to crawl blog reader comment #482

Closed carrud closed 3 years ago

carrud commented 6 years ago

i succesfully crawl blog content using norconex http collector, but after i looked in crawledfiles it seems not all content got crawled, especially reader comment section that dynamically generated by javascript in iframe, could you give me some configuration example to do this?

essiembre commented 6 years ago

The HTTP Collector does not interpret JavaScript. For this, you'll have to either write your own ILinkExtractor to extract dynamically invoked URLs, or use the PhantomJSDocumentFetcher which relies on PhantomJS to crawl JavaScript-driven websites.

carrud commented 6 years ago

I tried phantomJSDocumentFetcher within norconex http collector but only html content and screenshoot of the page was get by the crawler not the actual user comment in text format like i want, any help ? Or do i need additional configuration on phantom js?

essiembre commented 6 years ago

Are you giving it enough time to render? Maybe try increasing the renderWaitTime and resourceTimeout.

Are the comments always showing or you need to scroll down to have them show up? If you normally need to scroll down, you will likely have to modify the PhantomJS script that comes with the HTTP Collector to simulate scrolling (example here).

Maybe increasing your screenshot size could help as well.

If you can't resolve the issue, please consider sharing your URL.

carrud commented 6 years ago

thanks for your response sir,

i gave much time in renderWaitTime but i saw no impact, the comment load by itself after main page loaded and no need to scrooldown, actually i only need the article content and user comment in text format.. so i can omit screenshoot feature

here is my crawled content url https://www.washingtonpost.com/lifestyle/style/the-white-house-correspondents-dinner-doesnt-draw-stars-anymore-so-kathy-griffin-had-it-to-herself/2018/04/29/200356fa-4bba-11e8-af46-b1d6dc0d9bfe_story.html?utm_term=.53e6392b815a

what i want is to crawl user comment section on bottom of the page

and this is my norconex config :

<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs.https://www.norconex.com/product/collector-http-test/minimum.php -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.washingtonpost.com/lifestyle/style/the-white-house-correspondents-dinner-doesnt-draw-stars-anymore-so-kathy-griffin-had-it-to-herself/2018/04/29/200356fa-4bba-11e8-af46-b1d6dc0d9bfe_story.html?utm_term=.53e6392b815a</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/minimum</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="8000" />

      <documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
          <exePath>C:\phantomjs-2.1.1-windows\bin\phantomjs.exe</exePath>
          <renderWaitTime>12000</renderWaitTime>
          <referencePattern>^.*\$</referencePattern>    

        </documentFetcher>

      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./examples-output/minimum/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>
essiembre commented 6 years ago

After looking into it, it turns out the comments are loaded only when displayed (so on scrolling to them). This is why you do not get them with PhantomJS unless you simulate scrolling like previously mentioned.

The site also makes it difficult to crawl by forcing redirects to happen. PhantomJS can sometimes have difficulties with redirects.

If you can live without the comments, I was able to get the content with the default document fetcher, which is also much faster.

You may also want to disable <canonicalLinkDetector ignore="true" /> if that causes you trouble.

You could also extend the GenericDocumentFetcher to also fetch comments for a page by invoking the REST API the site uses via AJAX to pull them out if you can figure out how to do so. That's likely no trivial task though.

wolverline commented 6 years ago

I tried to crawl pages with infinite scroll and/or pagination (thanks to mobile, this is a growing trend) with PhantomJS but it doesn't seem to be working very well. Writing PhantomJS script itself is a big pain (no wonder why the project has been suspended). If you look at the the following url, it comes from the separate content server through REST API and I am sure a JS framework renders it. If you're able to access the API, I'd recommend it.

https://comments-api.ext.nile.works/v1/search?q=((childrenof%3A+https%3A%2F%2Fwww.washingtonpost.com%2Flifestyle%2Fstyle%2Fthe-white-house-correspondents-dinner-doesnt-draw-stars-anymore-so-kathy-griffin-had-it-to-herself%2F2018%2F04%2F29%2F200356fa-4bba-11e8-af46-b1d6dc0d9bfe_story.html+source%3Awashpost.com+(((state%3AUntouched++AND+user.state%3AModeratorApproved)+OR+(state%3AModeratorApproved++AND+user.state%3AModeratorApproved%2CUntouched)+OR+(state%3ACommunityFlagged%2CModeratorDeleted+AND+user.state%3AModeratorApproved)+)+)+++))+itemsPerPage%3A+15+sortOrder%3AreverseChronological+safeHTML%3Aaggressive+children%3A+2+childrenSortOrder%3Achronological+childrenItemsPerPage%3A3++(((state%3AUntouched++AND+user.state%3AModeratorApproved)+OR+(state%3AModeratorApproved++AND+user.state%3AModeratorApproved%2CUntouched)+OR+(state%3ACommunityFlagged%2CModeratorDeleted+AND+user.state%3AModeratorApproved)+)+)++pageAfter%3A%221525044545.417%22&appkey=prod.washpost.com

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.