Norconex / committer-core

Norconex Committer is a java library and command line application used to route content to local or remote target repositories, such as a search engine index.
http://www.norconex.com/collectors/committer-core
Apache License 2.0
4 stars 10 forks source link

Extracting fields from metadata #5

Closed AntonioAmore closed 10 years ago

AntonioAmore commented 10 years ago

At committer.commitBatch() function I try to get page's content for database writing.

public class CustomCommitter extends AbstractMappedCommitter {
...
    @Override
    protected void commitBatch(List<ICommitOperation> list) {
...
      metadata = ((IAddOperation) iCommitOperation).getMetadata();
...
     String content = metadata.getString(this.getContentSourceField());
...

Got NULL pointer exception at the last line of the listing. Have I misunderstood usage of the method?

java.lang.NullPointerException
    at java.util.TreeMap.getEntry(TreeMap.java:347)
    at java.util.TreeMap.get(TreeMap.java:278)
    at com.norconex.commons.lang.map.Properties.get(Properties.java:1222)
    at com.norconex.commons.lang.map.Properties.getString(Properties.java:498)
    at com.norconex.committer.mysql.CustomCommitter.commitBatch(CustomCommitter.java:xxx)
essiembre commented 10 years ago

Have you configured beforehand what metada field holds your content? If you don't it will take the document content stream as the source content field. In any case, the mapping is done for you and you have to rely on target fields, not source fields. The source fields are deleted after the mapping is performed (unless you flag if to preserve the source).

I recommend your read the javadoc for the AbstractMappedCommiter for more details: http://www.norconex.com/product/committer/apidocs/com/norconex/committer/AbstractMappedCommitter.html

Ignore the references to IDOL. Those have to be corrected.

AntonioAmore commented 10 years ago

I haven't configured any fields mapping, keeping defaults. And I've read the doc before asking the question - it's a pity, but I can't get myself how to use it.

I tried java String content = metadata.getString(this.getContentTargetField()); to get page's content, but received the same error.

AntonioAmore commented 10 years ago

I just want to get crawled page content to a variable in any configuration/mapping case (or most of). It were an idea to write the committer as generic as possible and contribute to community. Seems the task is too difficult for me now. Could you help me with this line of code?

essiembre commented 10 years ago

If you have not defined a target field for your content, the content won't be mapped to a metadata field and it explains why you do not get any content back with your line of code.

In such case you can obtain the content this way:

IAddOperation operation = // your operation 
InputStream is = operation.getContentStream();
// read the input stream

The reason you do not see code like this in existing committer implementations, such as Solr, is because they provide default target fields, so they are always specified (so content will always be mapped to a field automatically). In your case, you can also enforce a default, or check in your code if the target field has been specified to establish whether to read content from the stream or from the metadata field.

AntonioAmore commented 10 years ago

Setting of default fields names, as done at SolrCommiter does play for me. Thanks a lot!

essiembre commented 10 years ago

Glad to know!