Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Alter field names and labels, use UUID as document identifier #343

Closed danizen closed 7 years ago

danizen commented 7 years ago

So, the solr and elastic search committers share some common idioms, and in both cases, there is a default of using the crawled URL as the id for the document. I want to use a generated UUID for the document id, and I want to still keep the URL, ideally indexing it as a field named url rather than, I guess document.reference.

I also want to add some additional fields using the text extracted - e.g. run a summarizer, submit the text to MeSH On Demand using a backend API, and come up with a checksum based on the document's textual content (excluding boiler plate removed by Tika/Boilerpipe).

Can you help me out getting further started? Is just a pre-import processor? I guess what is confusing me here is that ImporterMetadata has some pre-named fields but is also a Map, and so I'm not sure.

essiembre commented 7 years ago

ImporterMetadata is a Map because the fields attached to each documents are not constrained (it may vary per documents, but also, can be whatever you like).

There are a couple ways to do what you want. The following is what I suggest.

For the Elasticsearch committer, you can overwrite the default to chose which field you want to be used as the ID. The <sourceReferenceField> does it, like this:

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    ... 
    <sourceReferenceField keep="false">YourGeneratedUUIDFieldName</sourceReferenceField>
</committer>

Keep in mind that if you are using a UUID field created each time a document is crawled or recrawled, the crawler won't reconcile modifications/deletions. For this, you will need an id that remains the same for each doc (the URL -- document.reference -- is usually best for web pages).

To keep the URL in a url field, you can rename or copy the document.reference field as a post-parse handler in the Importer section using RenameTagger or CopyTagger. For example:

  <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
      <rename fromField="document.reference" toField="url" overwrite="true" />
  </tagger>

Adding additional fields is no problem. In your case, you can create your own IDocumentTagger implementation to do all you want and add metadata fields. If you already have an external process modifying your documents, you can consider using (ExternalTransformer)[https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/transformer/impl/ExternalTransformer.html].

Finally for the checksum on content, keep in mind there is already one generated for each documents. Default implementation is (MD5DocumentChecksummer)[https://www.norconex.com/collectors/collector-core/latest/apidocs/com/norconex/collector/core/checksum/impl/MD5DocumentChecksummer.html].
By default the checksum it is not sent to committers, but you can have it sent like this:

  <documentChecksummer keep="true" targetField="MyChecksumField"/>

It is done after the Importer stage, so after all other transformation has taken place (i.e., on the "clean" content). You can tell it to use specific fields for checksuming as well.

danizen commented 7 years ago

Thank you, this works. I am very happy - I have official permission to be using this and Elastic Search rather than IBM Watson Explorer, although I have to document what would be needed to do the same crawl with IBM Watson Explorer. Old habits die hard...

essiembre commented 7 years ago

I hear you! I would actually be interested to know about your conclusions. :-) FYI, I sneaked in a UUIDTagger before making the official release in case you still want a UUID.

danizen commented 7 years ago

Noticed that, using it.