Closed MirtoBusico closed 9 years ago
Per design the collector.referrer-link-text
is only extracted on < a href=" "></a>
tags. The rational being the other ones likely don't have valuable body text. They either would have none, or other non-word HTML. Do you see other tags that could surround good text? If so, we can make this issue a feature request.
As for the title, you should get it already when present. If you confirm you have some href tags with a title attribute and it does not get picked up, I would consider it a bug and look into it.
Well, I'm trying to learn how to use the collector. To start I'm analizyng the URL http://www.cirf.org/italian/home.html
If I understood correctly I should me able to get "L'associazione" in collector.referrer-link-text for the source:
<ul id="hmenu" class="hmenu">
<li>
<a href="/italian/menu1/cirf/">
L'associazione
</a>
<ul style="display: block;"></ul>
</li>
But seems I'm not able to extract this link.
If it can be useful I can copy here the full .xml
Thanks for your time.
I can reproduce the problem. You should be getting the text. I am marking this one as a bug.
Fixed in "develop" branch. Let me know if you need a snapshot release.
The fix is available now in a new snapshot release.
Please give it a try.
To install it, download 2.1.0-SNAPSHOT and copy its lib directory over the lib
directory found in collector installation. Review the Jars in the target directory and take out all duplicates you may find (removing/archiving older jar versions).
I'll try asap
2015-02-23 6:35 GMT+01:00 Pascal Essiembre notifications@github.com:
The fix is available now in a new snapshot release.
Please give it a try.
To install it, download 2.1.0-SNAPSHOT http://www.norconex.com/collectors/importer/download and copy its lib directory over the lib directory found in collector installation. Review the Jars in the target directory and take out all duplicates you may find (removing/archiving older jar versions).
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/56#issuecomment-75491739 .
Mybe I'm doing somethng wrong. I installed the snapshot and directed output to the filesystem. In the meta file dor "L'associazione" I find only:
collector.referrer-reference=http\://www.cirf.org/italian/home.html
collector.referrer-link-tag=a.href
Here the complete meta:
#
#Tue Feb 24 19:28:27 CET 2015
document.contentType=text/html
collector.referenced-urls=http\://www.restorerivers.eu/^|~http\://www.youtube.com/user/CIRFcomunicazione^|~http\://www.facebook.com/cirf.org^|~http\://www.cirf.org/italian/menu1/attivita/pubblicazioni.html^|~http\://www.cirf.org/italian/menu1/comeiscriversi/perche.html^|~http\://www.cirf.org/italian/menu2/documentazione/Circolari.html^|~http\://www.cirf.org/italian/menu2/Comevalutastatoecologico/^|~http\://www.cirf.org/italian/menu1/attivita/news/ecrr2013feedback.html^|~http\://www.cirf.org/italian/menu2/coseRF/^|~http\://www.cirf.org/images/logo-Convegno.jpg^|~http\://www.cirf.org/img/logo_ecrr_generico.png^|~http\://www.cirf.org/italian/menu2/documentazione/bibliotecacirf.html^|~http\://www.cirf.org/italian/menu1/cirf/chisiamo.html^|~http\://www.cirf.org/images/facebook_32.png^|~http\://www.cirf.org/italian/menu1/larivista/ultimonumero.html^|~http\://www.cirf.org/italian/menu1/cirf/gliassociati.html^|~http\://www.cirf.org/italian/menu1/comeiscriversi/^|~http\://www.cirf.org/italian/menu2/causedegrado/^|~http\://www.algoritma.it^|~http\://www.cirf.org/italian/menu1/larivista/^|~http\://www.cirf.org/italian/menu1/cirf/lepersoneeiruoli.html^|~http\://www.cirf.org/img/ecrr-logo.gif^|~http\://www.cirf.org/italian/menu1/cirf/cosavogliamo.html^|~http\://www.cirf.org/italian/menu1/larivista/cose.html^|~http\://www.cirf.org/italian/Appuntamenti/CIRF/ecrr2014.html^|~http\://www.cirf.org/italian/menu1/comeiscriversi/come.html^|~http\://www.cirf.org/italian/menu2/Come stanno/^|~http\://www.cirf.org/italian/menu2/documentazione/tesidilaurea.html^|~http\://www.cirf.org/img/logo_linkedin.png^|~http\://www.cirf.org/italian/menu1/cirf/comecifinanziamo.html^|~http\://www.cirf.org/img/algoritma.gif^|~http\://www.cirf.org/italian/menu2/Lineeazione/^|~http\://www.cirf.org/italian/menu1/attivita/^|~http\://www.cirf.org/italian/menu1/attivita/viaggistudio.html^|~http\://www.cirf.org/italian/menu1/attivita/corsi.html^|~http\://www.cirf.org/img/serelarefa.png^|~http\://www.cirf.org/italian/menu1/attivita/appuntamenti-CIRF.html^|~http\://www.cirf.org/italian/other/privacy/privacy.html^|~http\://www.cirf.org/italian/menu1/larivista/scaricalarivista.html^|~http\://www.cirf.org/img/logo_youtube.png^|~http\://www.cirf.org/img/logo_life.gif^|~http\://www.cirf.org/rf2012/index.html^|~http\://www.cirf.org/italian/menu2/documentazione/Articoli e scritti.html^|~http\://www.cirf.org/rf2012/atti_convegno.html^|~http\://www.cirf.org/italian/menu1/attivita/progetti.html^|~http\://www.linkedin.com/company/cirf---centro-italiano-per-la-riqualificazione-fluviale?trk\=company_name^|~http\://www.cirf.org/images/separator.gif^|~http\://www.cirf.org/images/waterdiss.jpg^|~http\://www.cirf.org/italian/menu1/cirf/statuto.html^|~http\://www.serelarefa.com/^|~http\://www.cirf.org/italian/menu2/documentazione/^|~http\://www.cirf.org/italian/menu2/Appuntamenti/^|~http\://www.cirf.org/italian/menu1/attivita/news-archivio.html^|~http\://www.waterdiss.eu/^|~http\://www.cirf.org/italian/menu1/cirf/curriculumcirf.html
Date=Tue, 24 Feb 2015 18\:28\:28 GMT
X-Parsed-By=org.apache.tika.parser.DefaultParser^|~org.apache.tika.parser.html.HtmlParser
author=ALGORITMA
Content-Location=http\://www.cirf.org/italian/menu1/cirf/
Cache-Control=no-store, no-cache, must-revalidate
Content-Encoding=UTF-8
collector.depth=1
keywords=Centro Italiano per la Riqualificazione Fluviale, associazione culturale, criteri della riqualificazione fluviale dei corsi d'acqua, gestione dei corsi d'acqua in Italia, progetti e interventi a carattere innovativo, malessere dei nostri fiumi e del territorio, evitare politiche e comportamenti miopi, promuovere la riqualificazione fluviale, corsi di formazione, viaggi studio, convegni e seminari
Pragma=no-cache
Expires=Wed, 31 Dec 1969 23\:59\:59 GMT
Content-Type=text/html;charset\=ISO-8859-1^|~text/html; charset\=UTF-8
Server=Apache
ETag=W/"81-1231418349000"
Last-Modified=Thu, 08 Jan 2009 12\:39\:09 GMT
document.reference=http\://www.cirf.org/italian/menu1/cirf/
collector.referrer-reference=http\://www.cirf.org/italian/home.html
collector.content-encoding=text/html;ISO-8859-1
Vary=Accept-Encoding
collector.content-type=text/html
description=Il CIRF (Centro Italiano per la Riqualificazione Fluviale) � un'associazione culturale tecnico-scientifica senza fini di lucro fondata nel luglio 1999.
document.contentFamily=html
X-UA-Compatible=IE\=EmulateIE7
collector.referrer-link-tag=a.href
Connection=close
Content-Language=it
dc\:title=L'associazione - CIRF
title=L'associazione - CIRF
Mybe the configuration file is wrong; here it is:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2014 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<httpcollector id="Crawl">
#set($core = "com.norconex.collector.core")
#set($http = "com.norconex.collector.http")
#set($committer = "com.norconex.committer")
#set($httpClientFactory = "${http}.client.impl.GenericHttpClientFactory")
#set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
#set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter")
#set($filterRegexMeta = "${core}.filter.impl.RegexMetadataFilter")
#set($robotsTxt = "${http}.robot.impl.StandardRobotsTxtProvider")
#set($robotsMeta = "${http}.robot.impl.StandardRobotsMetaProvider")
#set($metaFetcher = "${http}.fetch.impl.GenericMetadataFetcher")
#set($docFetcher = "${http}.fetch.impl.GenericDocumentFetcher")
#set($linkExtractor = "${http}.url.impl.HtmlLinkExtractor")
#set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer")
#set($sitemapFactory = "${http}.sitemap.impl.StandardSitemapResolverFactory")
#set($metaChecksummer = "${http}.checksum.impl.HttpMetadataChecksummer")
#set($docChecksummer = "${core}.checksum.impl.MD5DocumentChecksummer")
#set($dataStoreFactory = "${core}.data.store.impl.mapdb.MapDBCrawlDataStoreFactory")
<progressDir>${workdir}/progress</progressDir>
<logsDir>${workdir}/logs</logsDir>
<crawlerDefaults>
<urlNormalizer class="$urlNormalizer" />
<numThreads>8</numThreads>
<maxDepth>1</maxDepth>
<workDir>$workdir</workDir>
<orphansStrategy>DELETE</orphansStrategy>
<!-- To ignore robots.txt files -->
<robotsTxt ignore="true" />
<!-- To ignore in-page robot rules -->
<robotsMeta ignore="true" />
<urlNormalizer class="$urlNormalizer">
<normalizations>
lowerCaseSchemeHost, upperCaseEscapeSequence
decodeUnreservedCharacters, removeDefaultPort
</normalizations>
<replacements>
<replace>
<match>&view=print</match>
<replacement>&view=html</replacement>
</replace>
</replacements>
</urlNormalizer>
<referenceFilters>
<filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js</filter>
<filter class="$filterRegexRef" onMatch="exclude">http://www\.adobe\.com/.*</filter>
</referenceFilters>
<delay default="1000" />
<startURLs>
<url>http://www.cirf.org/italian/home.html</url>
</startURLs>
</crawlerDefaults>
<crawlers>
<crawler id="SOLR">
<linkExtractors>
<extractor class="${linkExtractor}" maxURLLength="2048"
ignoreNofollow="false" keepReferrerData="true">
<contentTypes>
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
</contentTypes>
<tags>
<tag name="a" attribute="href" />
<tag name="frame" attribute="src" />
<tag name="iframe" attribute="src" />
<tag name="img" attribute="src" />
<tag name="meta" attribute="http-equiv" />
</tags>
</extractor>
</linkExtractors>
<importer>
<preParseHandlers>
<transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer">
<reduce>\s</reduce>
<reduce>\n</reduce>
<reduce>\s\n</reduce>
</transformer>
</preParseHandlers>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
<copy fromField="Generator" toField="generator" overwrite="true" />
<!-- multiple copy tags allowed -->
<restrictTo caseSensitive="false"
field="*enerator">
</restrictTo>
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
<copy fromField="Author" toField="author" overwrite="true" />
<!-- multiple copy tags allowed -->
<restrictTo caseSensitive="false"
field="*uthor">
</restrictTo>
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
</tagger>
</postParseHandlers>
</importer>
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>${workdir}/crawledFilesX</directory>
</committer>
</crawler>
</crawlers>
</httpcollector>
Thanks again for your time
I ran with exactly your configuration and I got it with the latest snapshot:
collector.referrer-link-text=L&\#39;associazione
When using the lastest snapshot, how did you go about it? Did you install it over the existing installation? If so, you may have different versions of the same JARS. Especially check the norconex-* jars. Make sure you have only one version of every jar in the lib folder.
Well, I installed over an exisiting installation. I copied the importer snapshot lib content over the collector lib. I selected to overwrite the files during the copy; so I'm quite sure that there are only one copy of the lib
Thanks for the fix
My bad, I should have given you the latest snapshot of HTTP Collector, not just its importer module. I just made a new snapshot release for it. You can download it here.
Can you try with that release?
Note: If you overwrite an existing install, your current method will overwrite the files that are named the same only. Some files in the lib folder will have a different name. You can list the files in the lib directory by names and you will quickly find where there are several versions of the same file.
Norconex HTTP Collector 2.1.0 was released. Closing.
Hi, I'm trying to gater information about links: the text near che anchor. I'm using: norconex-collector-http-2.0.2.zip with openjdk-7
I have this definition:
But in th solr repository I find filled only:
collector.referrer-link-tag collector.referrer-reference
What I'm doing wrong?