Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Using extract="outerHtml]" in a domTagger's fields leads to no error!! #262

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

I just ran the following simple crawler:

<?xml version="1.0" encoding="UTF-8"?>

<!-- Testing crawler for Attribute extraction -->

<httpcollector id="Testattribute">

  <!-- Decide where to store generated files. -->
  <progressDir>./tests-output/testattribute/progress</progressDir>
  <logsDir>./tests-output/testattribute/logs</logsDir>

  <crawlers>
    <crawler id="Testattribute">

<!--
      <robotsTxt ignore="true" />
      <userAgent>Please identify your Crawler</userAgent>
      <numThreads>1</numThreads>
      <keepDownloads>false</keepDownloads>
        <preParseHandlers>
          <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                  onMatch="include" field="document.reference" >
            http://avax\.news/fact/The_Day_in_Photos_June_16_2016\.html
          </filter>
        </preParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger"
                  caseSensitive="true"
                  sourceCharset="UTF-8">
            <pattern field="isrc" group="1">
              &lt;img *.*src="([^"]+)".*&gt;
            </pattern>
          </tagger>
-->

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://avax.news/fact/The_Day_in_Photos_June_16_2016.html</url>
      </startURLs>

      <workDir>./tests-output/testattribute</workDir>

      <maxDepth>0</maxDepth>  <!-- TODO: Set to 2??? -->

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <!-- Since 2.3.0: -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
          http://avax.news/fact/The_Day_in_Photos_June_16_2016.html
        </filter>
      </referenceFilters>

      <!-- Document extraction/manipulation -->
      <importer>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference</fields>
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="a>img" toField="IMAGE"
                 overwrite="true"
                 extract="outerHtml" />
          </tagger>
        </postParseHandlers>
      </importer>

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./tests-output/testattribute/crawledFiles</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

Output was:

INFO  [SitemapStore] Testattribute: Initializing sitemap store...
INFO  [SitemapStore] Testattribute: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Testattribute: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://avax.news/fact/The_Day_in_Photos_June_16_2016.html
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://avax.news/fact/The_Day_in_Photos_June_16_2016.html
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=50+first+dates+720p
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=bones+s08e02
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/fact/The_Day_in_Photos_June_16_2016.html
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=graphics
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=white+tiger
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=child+44+2015+720p
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=phtographers+waves
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=SLAVERY%2BPLANTATION
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/sad
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/educative
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=chillar+party+movie
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/touching
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=anal%2Bcollection
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/admin/news/11630/edit
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/ohlala/disclaimer
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/assets/logo-d9927e7b6d62c2ea2a394f6a13eddbde.png
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/wow
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/about_us.html
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/funny
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=feria+de+abril
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=hard+west
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=need+for+speed+most+wanted
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/tags_cloud
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=parade+
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=victoria+secret
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/disgusting
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/charming
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=jack+nicholson
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=child%2Bfuck
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/fact/The_Day_in_Photos_June_16_2016.html#disqus_thread
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=cargo
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=WW+I
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=beautiful+babes
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/tags/Pictures%20of%20Recent%20Events
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=snue
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=star+wars+revenge
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=lovers+and+friends
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=Baliem+Valley+Festival
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/users/sign_in
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=friends+9
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/search?q=rhythmic+gymnastic
INFO  [CrawlerEventManager]         REJECTED_TOO_DEEP: http://avax.news/fact
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://avax.news/fact/The_Day_in_Photos_June_16_2016.html
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://avax.news/fact/The_Day_in_Photos_June_16_2016.html
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://avax.news/fact/The_Day_in_Photos_June_16_2016.html
INFO  [AbstractCrawler] Testattribute: Re-processing orphan references (if any)...
INFO  [AbstractCrawler] Testattribute: Reprocessed 0 orphan references...
INFO  [AbstractCrawler] Testattribute: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Testattribute: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Testattribute: Crawler completed.
INFO  [AbstractCrawler] Testattribute: Crawler executed in 2 seconds.
INFO  [JobSuite] Running Testattribute: END (Fri Jun 17 17:12:34 CEST 2016)

There is no error in the logs, even if the extract field is wrong??!!

By the way, even after correcting this mistake, this crawler does not seem to work (crawledFiles/xxx/xxx.cntnt contains the whole page as text instead of an IMAGE tag, xxx.meta and xxx.ref are OK the document reference). Do you have a hint where I'm mistaking?

essiembre commented 8 years ago

About the content, taggers do not modify the content, just the fields/metadata. Looking at your config, the HTML will be parsed normally and it is expected for you to get the content. "Transformers" will modify the content if that's what you want.

About the image tag you do not have... it is also normal since you have your DOM tagger as a POST-parse handler. Once parsing occured, the original document is gone and you are left with plain text (it is no longer an HTML document).

To work with a DOMTagger you have to follow the advice in its documentation "Should be used as a pre-parse handler." Give that a try, but make sure you do not put your KeepOnlyTagger after or that will get rid of your IMAGE field.

Also, since the config file is XML, in case this makes a difference I would suggest you escape your angle bracket: selector="a&gt;img".

The metadata/fields are stored in a .meta file when you use the FileSystemCommitter. Are you using the FileSystemCommitter just to troubleshoot for now? Because once you are satisfied with your crawler config, it is highly recommended you use your own committer instead (to avoid needing a separate process that reads those files).

essiembre commented 8 years ago

Do you still have questions/issues related to this ticket or can we close?

FYI, a new snapshot release of HTTP Collector was made with an updated Importer module in it. It gives DOMTagger more options.

liar666 commented 8 years ago

Hi,

Sorry I had to switch to another task and I'm coming back to this one only now.

Thanks for the explanations. It appears that I'm still lacking a lot of understanding of the inner working of the tool (particularly the pre/post handlers, when is HTML/text handled, and Taggers/Transformers differences...). That's what comes with learning while doing... Whatever, I quite like the way Norconex's collector is designed (particularly with respect to Heritrix): it seems to rely on the composition of simple elements, which is more in the KISS/Unix way of designing/using tools and allows more flexibility & modularity, at the same time as keeping small and human-readable config files...

Yes, I'm using FileSystemCommitter for the moment to "get my grips" on the tool, but I'll of course implement a specific committer when I'll need to put our crawlers in production.

essiembre commented 8 years ago

Thanks for the good feedback! While sometimes difficult to apply, KISS/flexibility/modularity are indeed very important design drivers to us. I am glad you recognise that.