Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

HTTP collector and solr. #346

Closed or-dos closed 7 years ago

or-dos commented 7 years ago

Hi, can you help me? I try to run minimum example and get no errors but no data appear in the solr core.

:/opt/norconex-col$ ./collector-http.sh -a start -c examples/minimum/minimum-config.xml
INFO [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress
INFO [JobSuite] JEF work directory is: ./examples-output/minimum/progress
INFO [JobSuite] JEF log manager is : FileLogManager
INFO [JobSuite] JEF job status store is : FileJobStatusStore
INFO [AbstractCollector] Suite of 1 crawler jobs created.
INFO [JobSuite] Initialization...
INFO [JobSuite] Previous execution detected.
INFO [JobSuite] Backing up previous execution status and log files.
INFO [JobSuite] Starting execution.
INFO [AbstractCollector] Version: Norconex HTTP Collector 2.7.0 (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Collector Core 1.8.0 (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Importer 2.7.0 (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Committer Core 2.1.0 (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Committer Solr 2.3.0 (Norconex Inc.)
INFO [JobSuite] Running Norconex Minimum Test Page: BEGIN (Thu May 11 10:15:54 EEST 2017)
INFO [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO [HttpCrawler] Norconex Minimum Test Page: Sitemap support: false
INFO [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO [HttpCrawler] Norconex Minimum Test Page: User-Agent:
INFO [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
INFO [HttpCrawler] 1 start URLs identified.
INFO [CrawlerEventManager] CRAWLER_STARTED
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://www.norconex.com/product/collector-http-test/minimum.php
INFO [CrawlerEventManager] URLS_EXTRACTED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO [CrawlerEventManager] REJECTED_UNMODIFIED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO [AbstractCrawler] Norconex Minimum Test Page: Reprocessing any cached/orphan references...
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO [CrawlerEventManager] CRAWLER_FINISHED
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 5 seconds.
INFO [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO [JobSuite] Running Norconex Minimum Test Page: END (Thu May 11 10:15:54 EEST 2017)

and a part of minimum-config

<importer>
        <postParseHandlers>
           <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 
      <committer class="com.norconex.committer.solr.SolrCommitter">
      <solrURL>http://localhost:8983/solr/new_core</solrURL>
      <sourceReferenceField keep="false">document.reference</sourceReferenceField>
      <targetReferenceField>id</targetReferenceField>
      <targetContentField>text</targetContentField>
      <commitBatchSize>10</commitBatchSize>
      <queueDir>/optional/queue/path/</queueDir>
      <queueSize>100</queueSize>
      <maxRetries>2</maxRetries>
      <maxRetryWait>5000</maxRetryWait>
  </committer>
essiembre commented 7 years ago

The REJECTED_UNMODIFIED log entry means you previously ran the collector and there was no change since last time. When no changes are detected, it does not try to resend the data. If you want it to start again from scratch, you have to get rid of its internal cache. You can do so by deleting your working directory (where you find folders such as "crawlstore" and "progress").

You can also disable the checksum creation step and the crawler will not know if a document was modified and will process them as if they were new. You add the following under your <crawler ..> section:

      <metadataChecksummer disabled="true" />
      <documentChecksummer disabled="true" />    
or-dos commented 7 years ago

Thank you very much, it works, but collector runing normally only with sudo.

With sudo

sudo ./collector-http.sh -a start -c examples/minimum/minimum-config.xml
INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress
INFO  [JobSuite] JEF work directory is: ./examples-output/minimum/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.7.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.8.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.7.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Solr 2.3.0 (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Fri May 12 16:31:08 EEST 2017)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: false
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: <None specified>
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [AbstractCrawler] Norconex Minimum Test Page: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 4 files
INFO  [SolrCommitter] Sending 4 documents to Solr for update/deletion.
INFO  [SolrCommitter] Done sending documents to Solr for update/deletion.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 5 seconds.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Fri May 12 16:31:08 EEST 2017)

and without sudo

./collector-http.sh -a start -c examples/minimum/minimum-config.xml
INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress
INFO  [JobSuite] JEF work directory is: ./examples-output/minimum/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.7.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.8.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.7.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Solr 2.3.0 (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Fri May 12 16:31:35 EEST 2017)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: false
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: <None specified>
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://www.norconex.com/product/collector-http-test/minimum.php
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://www.norconex.com/product/collector-http-test/minimum.php (com.norconex.committer.core.CommitterException: Cannot create commit directory: /optional/queue/path/2017/05-12/04/31/38)
ERROR [AbstractCrawler] Norconex Minimum Test Page: Could not process document: https://www.norconex.com/product/collector-http-test/minimum.php (Cannot create commit directory: /optional/queue/path/2017/05-12/04/31/38)
com.norconex.committer.core.CommitterException: Cannot create commit directory: /optional/queue/path/2017/05-12/04/31/38
    at com.norconex.committer.core.impl.FileSystemCommitter.createFile(FileSystemCommitter.java:161)
    at com.norconex.committer.core.impl.FileSystemCommitter.add(FileSystemCommitter.java:94)
    at com.norconex.committer.core.AbstractFileQueueCommitter.queueAddition(AbstractFileQueueCommitter.java:143)
    at com.norconex.committer.core.AbstractCommitter.add(AbstractCommitter.java:96)
    at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:34)
    at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:27)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeCommitterPipeline(HttpCrawler.java:377)
    at com.norconex.collector.core.crawler.AbstractCrawler.processImportResponse(AbstractCrawler.java:564)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:404)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:786)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Unable to create directory /optional/queue/path/2017/05-12/04/31/38
    at org.apache.commons.io.FileUtils.forceMkdir(FileUtils.java:2472)
    at com.norconex.committer.core.impl.FileSystemCommitter.createFile(FileSystemCommitter.java:159)
    ... 14 more
INFO  [AbstractCrawler] Norconex Minimum Test Page: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 3 seconds.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Fri May 12 16:31:35 EEST 2017)
essiembre commented 7 years ago

Cannot create commit directory has all the indications of a permission problem. Make sure the directories specified have the right privileges for your user.

or-dos commented 7 years ago

You great! Now, everything working excellent. Thank you very much for your help!