Closed yvesnyc closed 9 years ago
Do you want to extract the URLs, or really keep the link tags? Because all URLs in a page are stored with the document metadata into a field called collector.referenced-urls
so it may be easier to just grab that.
If you really want to keep some of the tags, I can think of a few options:
<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
<ignoredContentTypes>text/html</ignoredContentTypes>
</documentParserFactory>
<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
<replace>
<fromValue><![CDATA[<a.*?href="(.*?)".*?>]]></fromValue>
<toValue>URL_START $1 URL_END</toValue>
</replace>
</transformer>
Thanks @essiembre .
The ReplaceTransfromer option is the one I was seeking. I guess the $1
or $X
represents the group number X
in the regex pattern of <fromValue>
. The URL_START/URL_END is a good idea.
I do not understand the collector.referenced-urls
field in the committed document. The type is a single String, but you describe it as all URLs on page. Should it not be a list or array? There are multiple URLs found on a page.
I did look at ReplaceTransformer but was not sure about the following. The documentation for ReplaceTransform shows XML configuration ...
<restrictTo caseSensitive="[false|true]"
field="(name of header/metadata field name to match)">
(regular expression of value to match)
</restrictTo>
Is the header a Norconex field? Maybe the use of the HttpCollector can mean to include Html elements (feature request?).
Thanks.
The collector.referenced-urls
is a multi-value field. It looks as a single string when you look how it is stored on file. That file is an internal storage format and you normally do not deal with it directly. you deal with a Committer that will read the internal files and convert them to whatever format is required by your target repository (so that's transparent). If you are creating your own Committer implementation, you will get all metadata fields/values as a special Java Map where values are List
(multi-value).
You can see metadata as all the properties that could be gathered during the processing of a document. Basically everything but the document content itself. So for an HTTP Crawl, it would contain HTTP Header fields. Importing will also add its own fields (like <meta ..>
field/values found in HTML files). You can add your own fields too. Those fields are typically what ends up being stored into field in your target repository. The content is stored separately as a stream (since it can be huge).
The field
in the <restrictTo...>
are any of the above. To find out what fields are present at any given time in the importer module, you can use the DebugTagger, which will log those fields for each documents:
<tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="WARN" >
The DebugTagger was very useful.
I think I found a bug with the Elasticsearch committer.
The collector.referenced-urls
is multi-valued in the crawler but is a single value in the Elasticsearch document.
Here is the log from the crawler when it finds the url http://redis.io/download
:
WARN - DebugTagger - Status=200 OK
WARN - DebugTagger - Keep-Alive=timeout=15, max=100
WARN - DebugTagger - collector.content-type=text/html
WARN - DebugTagger - document.contentFamily=html
WARN - DebugTagger - Connection=Keep-Alive
WARN - DebugTagger - Date=Wed, 17 Jun 2015 18:23:43 GMT
WARN - DebugTagger - Via=1.1 redis.io
WARN - DebugTagger - document.reference=http://redis.io/download
WARN - DebugTagger - document.contentType=text/html
WARN - DebugTagger - Vary=Accept-Encoding
WARN - DebugTagger - collector.depth=1
WARN - DebugTagger - collector.referenced-urls=http://redis.io/commands, http://redis.io/topics/license, http://redis.io/images/redis-white.png, http://redis.io/support, http://download.redis.io/releases/redis-3.0.2.tar.gz, https://raw.githubusercontent.com/antirez/redis/3.0/00-RELEASENOTES, https://github.com/antirez/redis-hashes/blob/master/README, http://redis.io/, https://github.com/antirez/redis/archive/unstable.tar.gz, https://github.com/MSOpenTech/redis, https://raw.githubusercontent.com/antirez/redis/2.6/00-RELEASENOTES, https://code.google.com/p/redis/downloads/list?can=1, https://github.com/antirez/redis-io, http://redis.io/images/pivotal.png, http://redis.io/documentation, http://www.pivotal.io/big-data/redis, http://redis.io/download, http://redis.io/topics/sponsors, https://raw.githubusercontent.com/antirez/redis/2.8/00-RELEASENOTES, http://download.redis.io/releases/redis-2.6.17.tar.gz, http://download.redis.io/redis-stable.tar.gz, http://redis.io/clients, https://github.com/antirez/redis/archive/2.8.21.tar.gz, http://download.redis.io/redis-stable, http://redis.io/community, http://try.redis-db.com
WARN - DebugTagger - Content-Type=text/html
Note the two fields:
document.reference=http://redis.io/download
collector.referenced-urls=http://redis.io/commands, ..., http://try.redis-db.com
When I query Elasticsearch I find the document:
"_type": "webdoc",
"_id": "http://redis.io/download",
"_score": 1.1190807,
"_source": {
"Status": "200 OK",
"collector.content-type": "text/html",
"Keep-Alive": "timeout=15, max=100",
"X-Parsed-By": "org.apache.tika.parser.html.HtmlParser",
"document.contentFamily": "html",
"Content-Location": "http://redis.io/download",
"Connection": "Keep-Alive",
"title": "Redis",
"Date": "Wed, 17 Jun 2015 13:38:58 GMT",
"content": "...",
"Via": "1.1 redis.io",
"document.reference": "http://redis.io/download",
"viewport": "width=device-width, minimum-scale=1.0, maximum-scale=1.0",
"dc:title": "Redis",
"document.contentType": "text/html",
"Content-Encoding": "UTF-8",
"Vary": "Accept-Encoding",
"collector.depth": "1",
"collector.referenced-urls": "http://try.redis-db.com",
"Content-Type": "text/html; charset=UTF-8"
}
Notice "collector.referenced-urls": "http://try.redis-db.com"
. Only the last element of the crawler field is stored. I think the Elasticsearch committer dropped all but the last element of the collector.referenced-urls
from the crawler.
On a side note,Content-Location
is the same as document.reference
. They mean the same thing. Same thing with collector.content-type
,document.contentType
,Content-Type
.
Thanks
@pascaldimassimo, as you are more familiar with Elasticsearch, can you confirm whether a field needs to be defined as multi-value in order to accept a multi-value?
@yvesnyc about the fields that have the same value, that's nothing unusual. See, all fields starting with collector.*
or document.*
are fields created and populated by the Collector (or Importer). Everything else is stored as is it found. In your examples, Content-Location
and Content-Type
come from the HTTP response headers when fetching a document. Even if those were ever missing, the collector and importer ones should always be there. For instance, you would not have HTTP header fields if you were to use the Filesystem Collector.
The collector/importer adds all the metadata it can find, but that's usually way to much for what you need. Normally, you probably want to restrict the fields to only what you need, using either KeepOnlyTagger or DeleteTagger in your <importer>
section (most often as a post handler).
elasticsearch-committer was not handling array values properly. I've committed a fix in the develop branch.
I made an Elasticsearch 2.0.2 snapshot release with the fix. If you have additional Elasticsearch issues, please report them here.
@essiembre @pascaldimassimo Thanks. I appreciate it.
I'm using the latest Norconex Http collector. By default the importer removes Html elements and just keeps the body text. How do I configure it to keep specific Html elements. For example,I would like the parsing output to include elements with a URL like the following:
Thus, the href url value would be found in the same relative position in the text.
Thanks.