Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

collector.referrer-link-text field not filled #56

Closed MirtoBusico closed 9 years ago

MirtoBusico commented 9 years ago

Hi, I'm trying to gater information about links: the text near che anchor. I'm using: norconex-collector-http-2.0.2.zip with openjdk-7

I have this definition:

<linkExtractors>
    <extractor class="${linkExtractor}"  maxURLLength="2048" 
            ignoreNofollow="false" keepReferrerData="true">
        <contentTypes>
            text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
        </contentTypes>
        <tags>
            <tag name="a" attribute="href" />
            <tag name="frame" attribute="src" />
            <tag name="iframe" attribute="src" />
            <tag name="img" attribute="src" />
            <tag name="meta" attribute="http-equiv" />
        </tags>
    </extractor>
</linkExtractors>

But in th solr repository I find filled only:

collector.referrer-link-tag collector.referrer-reference

What I'm doing wrong?

essiembre commented 9 years ago

Per design the collector.referrer-link-text is only extracted on < a href=" "></a> tags. The rational being the other ones likely don't have valuable body text. They either would have none, or other non-word HTML. Do you see other tags that could surround good text? If so, we can make this issue a feature request.

As for the title, you should get it already when present. If you confirm you have some href tags with a title attribute and it does not get picked up, I would consider it a bug and look into it.

MirtoBusico commented 9 years ago

Well, I'm trying to learn how to use the collector. To start I'm analizyng the URL http://www.cirf.org/italian/home.html

If I understood correctly I should me able to get "L'associazione" in collector.referrer-link-text for the source:

<ul id="hmenu" class="hmenu">
    <li>
      <a href="/italian/menu1/cirf/">
        L'associazione
    </a>
    <ul style="display: block;"></ul>
</li>

But seems I'm not able to extract this link.

If it can be useful I can copy here the full .xml

Thanks for your time.

essiembre commented 9 years ago

I can reproduce the problem. You should be getting the text. I am marking this one as a bug.

essiembre commented 9 years ago

Fixed in "develop" branch. Let me know if you need a snapshot release.

essiembre commented 9 years ago

The fix is available now in a new snapshot release.

Please give it a try.

To install it, download 2.1.0-SNAPSHOT and copy its lib directory over the lib directory found in collector installation. Review the Jars in the target directory and take out all duplicates you may find (removing/archiving older jar versions).

MirtoBusico commented 9 years ago

I'll try asap

2015-02-23 6:35 GMT+01:00 Pascal Essiembre notifications@github.com:

The fix is available now in a new snapshot release.

Please give it a try.

To install it, download 2.1.0-SNAPSHOT http://www.norconex.com/collectors/importer/download and copy its lib directory over the lib directory found in collector installation. Review the Jars in the target directory and take out all duplicates you may find (removing/archiving older jar versions).

— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/56#issuecomment-75491739 .

MirtoBusico commented 9 years ago

Mybe I'm doing somethng wrong. I installed the snapshot and directed output to the filesystem. In the meta file dor "L'associazione" I find only:

   collector.referrer-reference=http\://www.cirf.org/italian/home.html
   collector.referrer-link-tag=a.href

Here the complete meta:

#
#Tue Feb 24 19:28:27 CET 2015
document.contentType=text/html
collector.referenced-urls=http\://www.restorerivers.eu/^|~http\://www.youtube.com/user/CIRFcomunicazione^|~http\://www.facebook.com/cirf.org^|~http\://www.cirf.org/italian/menu1/attivita/pubblicazioni.html^|~http\://www.cirf.org/italian/menu1/comeiscriversi/perche.html^|~http\://www.cirf.org/italian/menu2/documentazione/Circolari.html^|~http\://www.cirf.org/italian/menu2/Comevalutastatoecologico/^|~http\://www.cirf.org/italian/menu1/attivita/news/ecrr2013feedback.html^|~http\://www.cirf.org/italian/menu2/coseRF/^|~http\://www.cirf.org/images/logo-Convegno.jpg^|~http\://www.cirf.org/img/logo_ecrr_generico.png^|~http\://www.cirf.org/italian/menu2/documentazione/bibliotecacirf.html^|~http\://www.cirf.org/italian/menu1/cirf/chisiamo.html^|~http\://www.cirf.org/images/facebook_32.png^|~http\://www.cirf.org/italian/menu1/larivista/ultimonumero.html^|~http\://www.cirf.org/italian/menu1/cirf/gliassociati.html^|~http\://www.cirf.org/italian/menu1/comeiscriversi/^|~http\://www.cirf.org/italian/menu2/causedegrado/^|~http\://www.algoritma.it^|~http\://www.cirf.org/italian/menu1/larivista/^|~http\://www.cirf.org/italian/menu1/cirf/lepersoneeiruoli.html^|~http\://www.cirf.org/img/ecrr-logo.gif^|~http\://www.cirf.org/italian/menu1/cirf/cosavogliamo.html^|~http\://www.cirf.org/italian/menu1/larivista/cose.html^|~http\://www.cirf.org/italian/Appuntamenti/CIRF/ecrr2014.html^|~http\://www.cirf.org/italian/menu1/comeiscriversi/come.html^|~http\://www.cirf.org/italian/menu2/Come stanno/^|~http\://www.cirf.org/italian/menu2/documentazione/tesidilaurea.html^|~http\://www.cirf.org/img/logo_linkedin.png^|~http\://www.cirf.org/italian/menu1/cirf/comecifinanziamo.html^|~http\://www.cirf.org/img/algoritma.gif^|~http\://www.cirf.org/italian/menu2/Lineeazione/^|~http\://www.cirf.org/italian/menu1/attivita/^|~http\://www.cirf.org/italian/menu1/attivita/viaggistudio.html^|~http\://www.cirf.org/italian/menu1/attivita/corsi.html^|~http\://www.cirf.org/img/serelarefa.png^|~http\://www.cirf.org/italian/menu1/attivita/appuntamenti-CIRF.html^|~http\://www.cirf.org/italian/other/privacy/privacy.html^|~http\://www.cirf.org/italian/menu1/larivista/scaricalarivista.html^|~http\://www.cirf.org/img/logo_youtube.png^|~http\://www.cirf.org/img/logo_life.gif^|~http\://www.cirf.org/rf2012/index.html^|~http\://www.cirf.org/italian/menu2/documentazione/Articoli e scritti.html^|~http\://www.cirf.org/rf2012/atti_convegno.html^|~http\://www.cirf.org/italian/menu1/attivita/progetti.html^|~http\://www.linkedin.com/company/cirf---centro-italiano-per-la-riqualificazione-fluviale?trk\=company_name^|~http\://www.cirf.org/images/separator.gif^|~http\://www.cirf.org/images/waterdiss.jpg^|~http\://www.cirf.org/italian/menu1/cirf/statuto.html^|~http\://www.serelarefa.com/^|~http\://www.cirf.org/italian/menu2/documentazione/^|~http\://www.cirf.org/italian/menu2/Appuntamenti/^|~http\://www.cirf.org/italian/menu1/attivita/news-archivio.html^|~http\://www.waterdiss.eu/^|~http\://www.cirf.org/italian/menu1/cirf/curriculumcirf.html
Date=Tue, 24 Feb 2015 18\:28\:28 GMT
X-Parsed-By=org.apache.tika.parser.DefaultParser^|~org.apache.tika.parser.html.HtmlParser
author=ALGORITMA
Content-Location=http\://www.cirf.org/italian/menu1/cirf/
Cache-Control=no-store, no-cache, must-revalidate
Content-Encoding=UTF-8
collector.depth=1
keywords=Centro Italiano per la Riqualificazione Fluviale, associazione culturale, criteri della riqualificazione fluviale dei corsi d'acqua, gestione dei corsi d'acqua in Italia, progetti e interventi a carattere innovativo, malessere dei nostri fiumi e del territorio, evitare politiche e comportamenti miopi, promuovere la riqualificazione fluviale, corsi di formazione, viaggi studio, convegni e seminari
Pragma=no-cache
Expires=Wed, 31 Dec 1969 23\:59\:59 GMT
Content-Type=text/html;charset\=ISO-8859-1^|~text/html; charset\=UTF-8
Server=Apache
ETag=W/"81-1231418349000"
Last-Modified=Thu, 08 Jan 2009 12\:39\:09 GMT
document.reference=http\://www.cirf.org/italian/menu1/cirf/
collector.referrer-reference=http\://www.cirf.org/italian/home.html
collector.content-encoding=text/html;ISO-8859-1
Vary=Accept-Encoding
collector.content-type=text/html
description=Il CIRF (Centro Italiano per la Riqualificazione Fluviale) � un'associazione culturale tecnico-scientifica senza fini di lucro fondata nel luglio 1999.
document.contentFamily=html
X-UA-Compatible=IE\=EmulateIE7
collector.referrer-link-tag=a.href
Connection=close
Content-Language=it
dc\:title=L'associazione - CIRF
title=L'associazione - CIRF

Mybe the configuration file is wrong; here it is:

 <?xml version="1.0" encoding="UTF-8"?>
 <!-- 
   Copyright 2010-2014 Norconex Inc.

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<httpcollector id="Crawl">

  #set($core      = "com.norconex.collector.core")
  #set($http      = "com.norconex.collector.http")
  #set($committer = "com.norconex.committer")

  #set($httpClientFactory = "${http}.client.impl.GenericHttpClientFactory")
  #set($filterExtension   = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef    = "${core}.filter.impl.RegexReferenceFilter")
  #set($filterRegexMeta   = "${core}.filter.impl.RegexMetadataFilter")
  #set($robotsTxt         = "${http}.robot.impl.StandardRobotsTxtProvider")
  #set($robotsMeta        = "${http}.robot.impl.StandardRobotsMetaProvider")
  #set($metaFetcher       = "${http}.fetch.impl.GenericMetadataFetcher")
  #set($docFetcher        = "${http}.fetch.impl.GenericDocumentFetcher")
  #set($linkExtractor     = "${http}.url.impl.HtmlLinkExtractor")
  #set($urlNormalizer     = "${http}.url.impl.GenericURLNormalizer")
  #set($sitemapFactory    = "${http}.sitemap.impl.StandardSitemapResolverFactory")
  #set($metaChecksummer   = "${http}.checksum.impl.HttpMetadataChecksummer")
  #set($docChecksummer    = "${core}.checksum.impl.MD5DocumentChecksummer")
  #set($dataStoreFactory  = "${core}.data.store.impl.mapdb.MapDBCrawlDataStoreFactory")

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlerDefaults>

<urlNormalizer class="$urlNormalizer" />
<numThreads>8</numThreads>
<maxDepth>1</maxDepth>
<workDir>$workdir</workDir>
<orphansStrategy>DELETE</orphansStrategy>

<!-- To ignore robots.txt files -->
<robotsTxt ignore="true" />

<!-- To ignore in-page robot rules -->
<robotsMeta ignore="true" />

<urlNormalizer class="$urlNormalizer">
  <normalizations>
    lowerCaseSchemeHost, upperCaseEscapeSequence 
    decodeUnreservedCharacters, removeDefaultPort 
  </normalizations>
  <replacements>
    <replace>
      <match>&amp;view=print</match>
      <replacement>&amp;view=html</replacement>
    </replace>
  </replacements>
</urlNormalizer>

<referenceFilters>
  <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js</filter>
  <filter class="$filterRegexRef" onMatch="exclude">http://www\.adobe\.com/.*</filter>
</referenceFilters>

<delay default="1000" />

<startURLs>
  <url>http://www.cirf.org/italian/home.html</url>
</startURLs>

  </crawlerDefaults>

  <crawlers>

<crawler id="SOLR">

  <linkExtractors>
      <extractor class="${linkExtractor}"  maxURLLength="2048" 
          ignoreNofollow="false" keepReferrerData="true">
      <contentTypes>
          text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
      </contentTypes>
      <tags>
          <tag name="a" attribute="href" />
          <tag name="frame" attribute="src" />
          <tag name="iframe" attribute="src" />
          <tag name="img" attribute="src" />
          <tag name="meta" attribute="http-equiv" />
      </tags>
      </extractor>
  </linkExtractors>

  <importer>

    <preParseHandlers>

      <transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
        <reduce>\s</reduce>
        <reduce>\n</reduce>
        <reduce>\s\n</reduce>
      </transformer>    

    </preParseHandlers>

    <postParseHandlers>

      <tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
      <copy fromField="Generator" toField="generator" overwrite="true" />
      <!-- multiple copy tags allowed -->

      <restrictTo caseSensitive="false"
          field="*enerator">
      </restrictTo>
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->

      </tagger>

      <tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
      <copy fromField="Author" toField="author" overwrite="true" />
      <!-- multiple copy tags allowed -->

      <restrictTo caseSensitive="false"
          field="*uthor">
      </restrictTo>
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->

      </tagger>

    </postParseHandlers>

  </importer>

  <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
    <directory>${workdir}/crawledFilesX</directory>
  </committer>

</crawler>

  </crawlers>

</httpcollector>

Thanks again for your time

essiembre commented 9 years ago

I ran with exactly your configuration and I got it with the latest snapshot:

collector.referrer-link-text=L&\#39;associazione

When using the lastest snapshot, how did you go about it? Did you install it over the existing installation? If so, you may have different versions of the same JARS. Especially check the norconex-* jars. Make sure you have only one version of every jar in the lib folder.

MirtoBusico commented 9 years ago

Well, I installed over an exisiting installation. I copied the importer snapshot lib content over the collector lib. I selected to overwrite the files during the copy; so I'm quite sure that there are only one copy of the lib

Thanks for the fix

essiembre commented 9 years ago

My bad, I should have given you the latest snapshot of HTTP Collector, not just its importer module. I just made a new snapshot release for it. You can download it here.

Can you try with that release?

Note: If you overwrite an existing install, your current method will overwrite the files that are named the same only. Some files in the lib folder will have a different name. You can list the files in the lib directory by names and you will quickly find where there are several versions of the same file.

essiembre commented 9 years ago

Norconex HTTP Collector 2.1.0 was released. Closing.