Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
184 stars 67 forks source link

Text from PDF, DOC, etc files #55

Closed csaezl closed 9 years ago

csaezl commented 9 years ago

Since it is not unusual that such types of files don't have title, author, subject, etc., I'm wondering if there is a way of capturing about (say) 100 characters or so from the beginning of the document, since the first few lines could contain a title or a subject. if so, I'd need to pass the text to a Solr field (not for indexing). It would be applicable to all kinds of text files the crawler can handle. Thanks Carlos

essiembre commented 9 years ago

Most docs will have a title in one way or another. PDF and DOC files have a title that can be set in their properties, and that title is normally picked-up. The problem is most authors don't bother setting one so the property is empty. :-)

There is no out-of-the-box handler to grab the first X characters from the document as a title, but there is a more generic one that allows you to do the equivalent. The following will do what you are after:

<tagger class="com.norconex.importer.handler.transformer.impl.TextBetweenTagger" inclusive="true" >
    <textBetween name="title">
      <start>^</start>
      <end>.{0,100}</end>
    </textBetween>
 </tagger>

That should go in your importer configuration section, under the postParseHandlers tag.

csaezl commented 9 years ago

Since some documents can have their "title" set, I need to grab the new text on a new field, for example "title_calc". So is it valid to put <textBetween name="title_calc">?

essiembre commented 9 years ago

Sure, it can be any name you like, the HTTP Collector does not care. In your case, you just have to make sure your Solr setup will accept it.

csaezl commented 9 years ago

Thanks Carlos

csaezl commented 9 years ago

I reopen this issue because testing the import code I've got an error. HTTP Collector was processing a web page. This is the importer section inside the config file:

      <importer>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.transformer.impl.TextBetweenTagger" inclusive="true" >
              <textBetween name="title_calc">
                <start>^</start>
                <end>.{0,500}</end>
              </textBetween>
          </tagger>
        </postParseHandlers>
      </importer> 

This is an excerpt from the execution:

INFO  [AbstractCrawler] Norconex Minimum-2 Test Page: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: com.norconex.collector.http.fet
ch.impl.GenericDocumentFetcher@e6e2f2)
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: none)
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: [http://www.norconex.com/collec
tors/img/collector-http.png, http://www.norconex.com/collectors/img/norconex-log
o-blue-241x51.png])
ERROR [Importer] Unsupported Import Handler: null
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: com.norconex.importer.response.
ImporterResponse@c4f3de)
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: SolrCommitter [solrURL=http://l
ocalhost:8983/solr/collection2, updateUrlParams={}, solrServerFactory=DefaultSol
rServerFactory [server=null], com.norconex.committer.solr.SolrCommitter@2d4185[q
ueueSize=100,docCount=542,queue=com.norconex.committer.core.impl.FileSystemCommi
tter@18b9645[directory=/optional/queue/path/],commitBatchSize=10,maxRetries=2,ma
xRetryWait=5000,operations=[],docCount=0,targetReferenceField=id,sourceReference
Field=document.reference,keepSourceReferenceField=false,targetContentField=conte
nt,sourceContentField=<null>,keepSourceContentField=false]])

In the same execution I've got a different error concerning the commit with Solr, but I'll post it in a new issue. Carlos

essiembre commented 9 years ago

Oups, my mistake this time: replace "transformer" with "tagger", giving this:

<tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive="true" >
csaezl commented 9 years ago

I still get an error:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://www.norconex.com/c
ollectors/img/norconex-logo-blue-241x51.png (Subject: com.norconex.collector.htt
p.fetch.impl.GenericDocumentFetcher@1110a48)
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://www.norconex.com/c
ollectors/img/norconex-logo-blue-241x51.png (Subject: none)
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://www.norconex.com/c
ollectors/img/norconex-logo-blue-241x51.png (Subject: [])
INFO  [CrawlerEventManager]           REJECTED_IMPORT: http://www.norconex.com/c
ollectors/img/norconex-logo-blue-241x51.png (Subject: com.norconex.importer.resp
onse.ImporterResponse@14fbb14)
ERROR [AbstractCrawler] Norconex Minimum-2 Test Page: Could not flag URL for del
etion: http://www.norconex.com/collectors/img/norconex-logo-blue-241x51.png (nul
l)
java.lang.NullPointerException
        at com.norconex.collector.core.crawler.AbstractCrawler.finalizeDocumentP
rocessing(AbstractCrawler.java:540)
        at com.norconex.collector.core.crawler.AbstractCrawler.processImportResp
onse(AbstractCrawler.java:521)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueued
CrawlData(AbstractCrawler.java:483)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextRefere
nce(AbstractCrawler.java:375)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnab
le.run(AbstractCrawler.java:628)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  [AbstractCrawler] Norconex Minimum-2 Test Page: 100% completed (2 processe
d/2 total)
essiembre commented 9 years ago

Looks like that error happened because you have not cleared your output files after the preview run, with the other error. It likely tried to do an incremental run over one that was bad to begin with. Clear all created directories and try again to ensure you start fresh.

If you are using the sample config, you should only have to delete ./examples-output/.

I will nonetheless look into fixing or better reporting the error when this specific situation happens.

csaezl commented 9 years ago

The run sample processes 3 documents, one php and two png. For all of them REJECTED_IMPORT occurs and no DOCUMENT_COMMITTED_ADD appears. Is that right?. I need your import code to act on documents (PDF, DOC, etc) to extract some text. Does this code reject other type of objects?. Here is an excerpt for the php document.

INFO  [JobSuite] JEF work directory is: .\examples-output\minimum2\progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [JobSuite] Running Norconex Minimum-2 Test Page: BEGIN (Thu Feb 19 11:41:2
7 CET 2015)
INFO  [MapDBCrawlDataStore] Initializing reference store .\examples-output\minim
um2/crawlstore/mapdb/Norconex_32_Minimum-2_32_Test_32_Page/
INFO  [MapDBCrawlDataStore] .\examples-output\minimum2/crawlstore/mapdb/Norconex
_32_Minimum-2_32_Test_32_Page/: Done initializing databases.
INFO  [HttpCrawler] Norconex Minimum-2 Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum-2 Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum-2 Test Page: Sitemap support: true
INFO  [SitemapStore] Norconex Minimum-2 Test Page: Initializing sitemap store...

INFO  [SitemapStore] Norconex Minimum-2 Test Page: Done initializing sitemap sto
re.
INFO  [CrawlerEventManager]           CRAWLER_STARTED (Subject: com.norconex.col
lector.http.crawler.HttpCrawler@e3f686)
INFO  [AbstractCrawler] Norconex Minimum-2 Test Page: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: com.norconex.collector.http.fet
ch.impl.GenericDocumentFetcher@f3ad6a)
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: none)
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: [http://www.norconex.com/collec
tors/img/collector-http.png, http://www.norconex.com/collectors/img/norconex-log
o-blue-241x51.png])
INFO  [CrawlerEventManager]           REJECTED_IMPORT: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: com.norconex.importer.response.
ImporterResponse@e23e3)
essiembre commented 9 years ago

Please attach your full config. There is always a reason for a page to be rejected. Nothing is rejected by default. Maybe you have a conflicting filter or something. Also, locate the /classes/log4j.properties file and change the log level to DEBUG. You may have better explanations in the logs by doing so.

This collector supports many file types. You can find the list here.

csaezl commented 9 years ago

This is the config file:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2014 Norconex Inc.
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
       http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and minimum recommended to 
     run a crawler.  
     -->
<httpcollector id="Minimum-2 Config HTTP Collector">
  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum2/progress</progressDir>
  <logsDir>./examples-output/minimum2/logs</logsDir>
  <crawlers>
    <crawler id="Norconex Minimum-2 Test Page">
      <!-- === Minimum required: =========================================== -->
      <!-- Requires at least one start URL. -->
      <startURLs>
        <url>http://www.norconex.com/product/collector-http-test/minimum.php</url>
      </startURLs>
      <!-- === Minimum recommended: ======================================== -->
      <!-- Where the crawler default directory to generate files is. -->
      <workDir>./examples-output/minimum2</workDir>
      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>10</maxDepth>
      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- At a minimum make sure you stay on your domain. -->
      <referenceFilters>
        <filter 
            class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="include" >
          http://www\.norconex\.com/.*
        </filter>
      </referenceFilters>

      <importer>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive="true" >
              <textBetween name="title_calc">
                <start>^</start>
                <end>.{0,500}</end>
              </textBetween>
          </tagger>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.solr.SolrCommitter">
        <solrURL>http://localhost:8983/solr/collection1</solrURL>
        <sourceReferenceField keep="false">document.reference</sourceReferenceField>
        <targetReferenceField>id</targetReferenceField>
        <targetContentField>content</targetContentField>
        <commitBatchSize>10</commitBatchSize>
        <queueDir>/optional/queue/path/</queueDir>
        <queueSize>100</queueSize>
        <maxRetries>2</maxRetries>
        <maxRetryWait>5000</maxRetryWait>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>

This is the log for one of the files:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://www.norconex.com/c
ollectors/img/norconex-logo-blue-241x51.png (Subject: com.norconex.collector.htt
p.fetch.impl.GenericDocumentFetcher@142f3be)
DEBUG [CachedInputStream] Creating memory cache from cached stream.
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://www.norconex.com/c
ollectors/img/norconex-logo-blue-241x51.png (Subject: none)
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://www.norconex.com/c
ollectors/img/norconex-logo-blue-241x51.png (Subject: [])
DEBUG [CachedInputStream] Creating new input stream from memory cache.
DEBUG [CachedInputStream] Creating memory cache from cached stream.
INFO  [CrawlerEventManager]           REJECTED_IMPORT: http://www.norconex.com/c
ollectors/img/norconex-logo-blue-241x51.png (Subject: com.norconex.importer.resp
onse.ImporterResponse@15bf12)
DEBUG [FileJobStatusStore] Writing status file: C:\Norconex-collector-http-2.0.2
\.\examples-output\minimum2\progress\latest\status\Norconex_32_Minimum-2_32_Test
_32_Page__Norconex_32_Minimum-2_32_Test_32_Page.job
DEBUG [FileJobStatusStore] Writing status file: C:\Norconex-collector-http-2.0.2
\.\examples-output\minimum2\progress\latest\status\Norconex_32_Minimum-2_32_Test
_32_Page__Norconex_32_Minimum-2_32_Test_32_Page.job
DEBUG [FileJobStatusStore] Writing status file: C:\Norconex-collector-http-2.0.2
\.\examples-output\minimum2\progress\latest\status\Norconex_32_Minimum-2_32_Test
_32_Page__Norconex_32_Minimum-2_32_Test_32_Page.job
INFO  [AbstractCrawler] Norconex Minimum-2 Test Page: 100% completed (2 processe
d/2 total)
csaezl commented 9 years ago

is there any solution?

essiembre commented 9 years ago

Turns out it is a bug where any subclass of AbstractCharStreamTagger with the "charset" specified in its metadata will fail. This is an easy fix that will be in the next release.

essiembre commented 9 years ago

The fix is available now in a new snapshot release.

Please give it a try.

To install it, download 2.1.0-SNAPSHOT and copy its lib directory over the lib directory found in collector installation. Review the Jars in the target directory and take out all duplicates you may find (removing/archiving older jar versions).

essiembre commented 9 years ago

Norconex HTTP Collector 2.1.0 was released. Closing.