Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Html elements import #15

Closed yvesnyc closed 9 years ago

yvesnyc commented 9 years ago

I'm using the latest Norconex Http collector. By default the importer removes Html elements and just keeps the body text. How do I configure it to keep specific Html elements. For example,I would like the parsing output to include elements with a URL like the following:

<a class='download-link' href='http://download.redis.io/releases/redis-3.0.1.tar.gz'>.

Thus, the href url value would be found in the same relative position in the text.

Thanks.

essiembre commented 9 years ago

Do you want to extract the URLs, or really keep the link tags? Because all URLs in a page are stored with the document metadata into a field called collector.referenced-urls so it may be easier to just grab that.

If you really want to keep some of the tags, I can think of a few options:

  <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
      <ignoredContentTypes>text/html</ignoredContentTypes>
  </documentParserFactory>
<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
  <replace>
      <fromValue><![CDATA[<a.*?href="(.*?)".*?>]]></fromValue>
      <toValue>URL_START $1 URL_END</toValue>
  </replace>
</transformer>
yvesnyc commented 9 years ago

Thanks @essiembre .

The ReplaceTransfromer option is the one I was seeking. I guess the $1 or $X represents the group number X in the regex pattern of <fromValue> . The URL_START/URL_END is a good idea.

I do not understand the collector.referenced-urls field in the committed document. The type is a single String, but you describe it as all URLs on page. Should it not be a list or array? There are multiple URLs found on a page.

I did look at ReplaceTransformer but was not sure about the following. The documentation for ReplaceTransform shows XML configuration ...

 <restrictTo caseSensitive="[false|true]"
              field="(name of header/metadata field name to match)">
          (regular expression of value to match)
      </restrictTo>

Is the header a Norconex field? Maybe the use of the HttpCollector can mean to include Html elements (feature request?).

Thanks.

essiembre commented 9 years ago

The collector.referenced-urls is a multi-value field. It looks as a single string when you look how it is stored on file. That file is an internal storage format and you normally do not deal with it directly. you deal with a Committer that will read the internal files and convert them to whatever format is required by your target repository (so that's transparent). If you are creating your own Committer implementation, you will get all metadata fields/values as a special Java Map where values are List (multi-value).

You can see metadata as all the properties that could be gathered during the processing of a document. Basically everything but the document content itself. So for an HTTP Crawl, it would contain HTTP Header fields. Importing will also add its own fields (like <meta ..> field/values found in HTML files). You can add your own fields too. Those fields are typically what ends up being stored into field in your target repository. The content is stored separately as a stream (since it can be huge).

The field in the <restrictTo...> are any of the above. To find out what fields are present at any given time in the importer module, you can use the DebugTagger, which will log those fields for each documents:

 <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="WARN" >
yvesnyc commented 9 years ago

The DebugTagger was very useful.

I think I found a bug with the Elasticsearch committer.

The collector.referenced-urls is multi-valued in the crawler but is a single value in the Elasticsearch document.

Here is the log from the crawler when it finds the url http://redis.io/download:

WARN  - DebugTagger                - Status=200 OK
WARN  - DebugTagger                - Keep-Alive=timeout=15, max=100
WARN  - DebugTagger                - collector.content-type=text/html
WARN  - DebugTagger                - document.contentFamily=html
WARN  - DebugTagger                - Connection=Keep-Alive
WARN  - DebugTagger                - Date=Wed, 17 Jun 2015 18:23:43 GMT
WARN  - DebugTagger                - Via=1.1 redis.io
WARN  - DebugTagger                - document.reference=http://redis.io/download
WARN  - DebugTagger                - document.contentType=text/html
WARN  - DebugTagger                - Vary=Accept-Encoding
WARN  - DebugTagger                - collector.depth=1
WARN  - DebugTagger                - collector.referenced-urls=http://redis.io/commands, http://redis.io/topics/license, http://redis.io/images/redis-white.png, http://redis.io/support, http://download.redis.io/releases/redis-3.0.2.tar.gz, https://raw.githubusercontent.com/antirez/redis/3.0/00-RELEASENOTES, https://github.com/antirez/redis-hashes/blob/master/README, http://redis.io/, https://github.com/antirez/redis/archive/unstable.tar.gz, https://github.com/MSOpenTech/redis, https://raw.githubusercontent.com/antirez/redis/2.6/00-RELEASENOTES, https://code.google.com/p/redis/downloads/list?can=1, https://github.com/antirez/redis-io, http://redis.io/images/pivotal.png, http://redis.io/documentation, http://www.pivotal.io/big-data/redis, http://redis.io/download, http://redis.io/topics/sponsors, https://raw.githubusercontent.com/antirez/redis/2.8/00-RELEASENOTES, http://download.redis.io/releases/redis-2.6.17.tar.gz, http://download.redis.io/redis-stable.tar.gz, http://redis.io/clients, https://github.com/antirez/redis/archive/2.8.21.tar.gz, http://download.redis.io/redis-stable, http://redis.io/community, http://try.redis-db.com
WARN  - DebugTagger                - Content-Type=text/html

Note the two fields: document.reference=http://redis.io/download collector.referenced-urls=http://redis.io/commands, ..., http://try.redis-db.com

When I query Elasticsearch I find the document:

            "_type": "webdoc",
            "_id": "http://redis.io/download",
            "_score": 1.1190807,
            "_source": {
               "Status": "200 OK",
               "collector.content-type": "text/html",
               "Keep-Alive": "timeout=15, max=100",
               "X-Parsed-By": "org.apache.tika.parser.html.HtmlParser",
               "document.contentFamily": "html",
               "Content-Location": "http://redis.io/download",
               "Connection": "Keep-Alive",
               "title": "Redis",
               "Date": "Wed, 17 Jun 2015 13:38:58 GMT",
               "content": "...",
               "Via": "1.1 redis.io",
               "document.reference": "http://redis.io/download",
               "viewport": "width=device-width, minimum-scale=1.0, maximum-scale=1.0",
               "dc:title": "Redis",
               "document.contentType": "text/html",
               "Content-Encoding": "UTF-8",
               "Vary": "Accept-Encoding",
               "collector.depth": "1",
               "collector.referenced-urls": "http://try.redis-db.com",
               "Content-Type": "text/html; charset=UTF-8"
            }

Notice "collector.referenced-urls": "http://try.redis-db.com" . Only the last element of the crawler field is stored. I think the Elasticsearch committer dropped all but the last element of the collector.referenced-urls from the crawler.

On a side note,Content-Location is the same as document.reference. They mean the same thing. Same thing with collector.content-type,document.contentType,Content-Type.

Thanks

essiembre commented 9 years ago

@pascaldimassimo, as you are more familiar with Elasticsearch, can you confirm whether a field needs to be defined as multi-value in order to accept a multi-value?

@yvesnyc about the fields that have the same value, that's nothing unusual. See, all fields starting with collector.* or document.* are fields created and populated by the Collector (or Importer). Everything else is stored as is it found. In your examples, Content-Location and Content-Type come from the HTTP response headers when fetching a document. Even if those were ever missing, the collector and importer ones should always be there. For instance, you would not have HTTP header fields if you were to use the Filesystem Collector.

The collector/importer adds all the metadata it can find, but that's usually way to much for what you need. Normally, you probably want to restrict the fields to only what you need, using either KeepOnlyTagger or DeleteTagger in your <importer> section (most often as a post handler).

pascaldimassimo commented 9 years ago

elasticsearch-committer was not handling array values properly. I've committed a fix in the develop branch.

essiembre commented 9 years ago

I made an Elasticsearch 2.0.2 snapshot release with the fix. If you have additional Elasticsearch issues, please report them here.

yvesnyc commented 9 years ago

@essiembre @pascaldimassimo Thanks. I appreciate it.