Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

Feature Request: Nested Fields #16

Closed jmrichardson closed 7 years ago

jmrichardson commented 7 years ago

Hello,

I would like to be able to create a nested field in my elastic search ingested documents. (Reference)

Rather than doing this programaticaly, One potential idea would be to allow the user to modify the default mapping using a configuration json file in a text editor. (this is the way it was done in another crawler). Essentially, the default mapping was provided in a json file, then I was able to edit it to allow for the field to be included when creating the index. The committer would just need to validate the integrity or just send as is to ES.

Thank you for the hard work on this project. John

essiembre commented 7 years ago

Copied from https://github.com/Norconex/collector-filesystem/issues/15:

As a temporary fix (and potential idea to add to committer):

First delete the index if it already exists Create the index with just the nested field

PUT wmsearch
{
  "mappings": {
    "doc": {
      "properties": {
        "scope" : {
          "type" : "nested",
          "properties" : {
            "level" : { 
              "type" : "integer"
            },
            "ancestors" : { 
              "type" : "keyword",
              "index" : "true"
            },
            "value" : { 
              "type" : "keyword",
              "index" : "true"
            },
            "order" : {
              "type" : "integer"
            }    
          }
        }
      }
    }
  }
}

Then start the collector. The end result is the fields created by the collector and the nested field (scope) persists:

{
  "wmsearch": {
    "mappings": {
      "doc": {
        "properties": {
          "Application-Name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Application-Version": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Author": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Character Count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Character-Count-With-Spaces": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Company": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Content-Encoding": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Content-Length": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Content-Type": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Creation-Date": {
            "type": "date"
          },
          "Edit-Time": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Last-Author": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Last-Modified": {
            "type": "date"
          },
          "Last-Printed": {
            "type": "date"
          },
          "Last-Save-Date": {
            "type": "date"
          },
          "Line-Count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Page-Count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Paragraph-Count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Revision-Number": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Template": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "Word-Count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "X-Parsed-By": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "collector": {
            "properties": {
              "content-type": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "filesize": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "is-crawl-new": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "lastmodified": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "content": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "cp:revision": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "crawl_date": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "creator": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "custom:_AdHocReviewCycleID": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "custom:_AuthorEmail": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "custom:_AuthorEmailDisplayName": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "custom:_EmailSubject": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "date": {
            "type": "date"
          },
          "dc:creator": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "dc:title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "dcterms:created": {
            "type": "date"
          },
          "dcterms:modified": {
            "type": "date"
          },
          "document": {
            "properties": {
              "contentFamily": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "contentType": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "filename": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "generatedTitle": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "reference": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "extended-properties:AppVersion": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "extended-properties:Application": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "extended-properties:Company": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "extended-properties:Template": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:author": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:character-count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:character-count-with-spaces": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:creation-date": {
            "type": "date"
          },
          "meta:last-author": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:line-count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:page-count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:paragraph-count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "meta:print-date": {
            "type": "date"
          },
          "meta:save-date": {
            "type": "date"
          },
          "meta:word-count": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "modified": {
            "type": "date"
          },
          "scope": {
            "type": "nested",
            "properties": {
              "ancestors": {
                "type": "keyword"
              },
              "level": {
                "type": "integer"
              },
              "order": {
                "type": "integer"
              },
              "value": {
                "type": "keyword"
              }
            }
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "xmpTPg:NPages": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}
essiembre commented 7 years ago

The problem isn't so much to define the original mapping in Elasticsearch (which could already be done separately from the Committer).

The issue is how to efficiently map flat multi-value fields to nested fields/objects. Can you give me an example of document you are trying to submit?

Assume this original structure:

<scope>
  <ancestors>ancestors1</ancestors>
  <order>11</order>
  <value>value1</value>
</scope>
<scope>
  <ancestors>ancestors2</ancestors>
  <level>2</level>
  <order>22</order>
  <value>value2</value>
</scope>
<scope>
  <ancestors>ancestors3</ancestors>
  <level>3</level>
  <value>value3</value>
</scope>

Because this document needs to be "flatten" before sending to the committer, it will look somethign like this (conceptually):

scope.ancestors = [ancestors1, ancestors2, ancestors3]
scope.level = [2, 3]
scope.order = [11, 22]
scope.value = [value1, value2, value3]

Recent Elasticsearch versions support the "dot", unless it is already defined as a nested object (what you want). So we'll have to specify somewhere what these fields are (using your ES JSON samples or other). But how can we tell what the original structure was? We can't assume arrays will always hold the sane number of values so we cannot rely on a value position in the array to reconstruct.

One option is to impose a document field naming convention for nested fields and leave it to implementors to make sure they tweak their nested fields names accordingly before a document hits the committer (e.g. using taggers). I can think of two approaches:

Suggestion 1: Expect committed documents to have "indexed" field names:

scope[0].ancestors = ancestors1
scope[0].order = 11
scope[0].value = value1
scope[1].ancestors = ancestors2
scope[1].level = 2
scope[1].order = 22
scope[1].value = value2
scope[2].ancestors = ancestors3
scope[2].level = 3
scope[2].value = value3

Suggestion 2: Expect committed documents to use JSON for a nested field value:

scope = [
    {
        "ancestors":"ancestors1",
        "order":11,
        "value":"value1",
    },
    {
        "ancestors":"ancestors2",
        "level": 2,
        "order":22,
        "value":"value2",
    },
    {
        "ancestors":"ancestors3",
        "level": 3,
        "value":"value3",
    }
  ]

The second approach could work with additional scenarios as well, but that would not always be easy to create that structure via your XML configuration (even ScriptTagger or other taggers can help).

What do you think? Am I overthinking it? Can you think of a better/easier approach?

jmrichardson commented 7 years ago

IMHO, and in my use case, the nested fields are derived from the document content/metadata. IOW, I don't have to modify the original documents to acquire the field values. In my case, I am using searchkit in front of ES which allows me to create a hierarchical menu stored indexed in ES. See here for details.

I want to be able to create a nested field mapping and dynamically create the values based on the file path (document.reference). I agree that if you flatten the fields you will lose the relationships. I am hoping that now that I have a temp solution to create the mappings, I can now use importer scripttagger to define the field values. I have not tested this yet, but I am hoping it doesn't try to first verify the nested field exists or try create it (as it was created prior to starting the crawler). And also that it will allow me to assign values to the nested field (again without creating it). I haven't created the javascript yet to take the path and create the nested field values that searchkit is looking for yet.

My initial thought is that first you allow the nested field to be defined and populated via the importer without modifying the original documents (I have over 6M documents that will need to be crawled and don't want to have to change the document to get it to import correctly). You may want to add this functionality after first getting the tagger to work with a nested field (this may already work). If it does already work, it will solve most of your use cases as nested data for documents I would imagine are derived from the content/metadata (you would only need to allow for the proper mappings to be created in the committer). On the other hand, I can think of another potential use case ... say having the original document and then have another document (perhaps the same name with different extension) that could contain nested data information (either in raw or JSON format as in suggest 2). But I am not sure the value would be worth the effort of implementation?

I will keep you posted this evening how it goes. Please let me know if the above doesn't make sense or not advisable. Thanks.

jmrichardson commented 7 years ago

Unfortunately, I am unable to assign values to the nested field using taggerscript. I have tried multiple ways such as:

        <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
          <script><![CDATA[
      metadata.setString('scope.level','2');
          ]]></script>
        </tagger>

I am not sure if it's because it believes that it has to create the field that already exists in the index (created manaually). Or if it is because of the "dot" notation you mentioned earlier. Here is a snip of the log:

INFO  [AbstractCrawler] WM Search: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 14 files
INFO  [ElasticsearchCommitter] Sending 10 commit operations to Elasticsearch.
INFO  [AbstractCrawler] WM Search: Crawler executed in 2 seconds.
FATAL [JobSuite] Fatal error occured in job: WM Search
INFO  [JobSuite] Running WM Search: END (Thu Sep 21 16:04:28 EDT 2017)
FATAL [JobSuite] Job suite execution failed: WM Search
java.lang.NoSuchMethodError: org.json.JSONArray.iterator()Ljava/util/Iterator;
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.extractResponseErrors(ElasticsearchCommitter.java:493)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.handleResponse(ElasticsearchCommitter.java:469)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:442)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:387)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:273)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:227)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:183)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.fs.FilesystemCollector.main(FilesystemCollector.java:76)

I am really hoping we can get this to work. Please let me know what you need from me. I am not a java developer but happy to help in any way I can. Thanks again

jmrichardson commented 7 years ago

After re-reading and better understanding your original suggestions (my apologies for not fully understanding the architecture), I believe what you are suggesting in #2 is the user creates the appropriate JSON string to add the values for nested fields (perhaps with scripttagger). This is actually a great way to solve this problem. In fact, I was using "fscrawler" prior to learning of this product and that is similar to what I had to do. I had to intercept the JSON string (between ES and fscrawler) and modify the JSON string to include the field data. In this case however, I could just use taggerscript to create the JSON string which could be returned for integration into the overall JSON string.

essiembre commented 7 years ago

A new snapshot version of the Elasticsearch Committer has been created with a solution.

You can now add an optional <jsonFieldsPattern> to your committer config. This takes a regular expression identifying one or more fields containing a JSON object as opposed to a regular String. In your case:

    ...
    <jsonFieldsPattern>scope</jsonFieldsPattern>
    ...

The challenge becomes managing to create and store a JSON structure in a field called "scope". E.g. of what the "scope" field value should look like (assuming multiple in this example):

[
    {
        "ancestors":"ancestors1",
        "order":11,
        "value":"value1"
    },
    {
        "ancestors":"ancestors2",
        "level": 2,
        "order":22,
        "value":"value2"
    },
    {
        "ancestors":"ancestors3",
        "level": 3,
        "value":"value3"
    }
  ]

It worked in my own testing. Please give it a try and confirm.

jmrichardson commented 7 years ago

Hooray! Wonderful! Thank you so much :) I will be testing this out as well as the other issue ASAP.

I assume that the field mappings will be created/managed automatically? Or will I need to create the mappings?

Will keep you posted. Thanks again!

essiembre commented 7 years ago

You can try, but I think you may have to create the mappings in ES beforehand.

jmrichardson commented 7 years ago

Ok, sounds good

essiembre commented 7 years ago

If you want to be able to define the mappings in the XML configuration somehow, you can make this your next feature request. :-)

jmrichardson commented 7 years ago

Thank you, I was able to generate the JSON string using the scripttagger and assigning it to the nested "scope" field. Works great :)

jmrichardson commented 7 years ago

Hi,

It appears that this parameter no longer works.

<jsonFieldsPattern>scope</jsonFieldsPattern>

I am getting these errors:

{
    "_index": "wmsearch",
    "_type": "Documents",
    "_id": "file:///xxxxxxx",
    "status": 400,
    "error": {
        "type": "illegal_argument_exception",
        "reason": "object mapping [scope] can't be changed from nested to non-nested"
    }
},

In looking at the commit directory for a document meta file (xxxxxx-add.meta), it looks like the scope field is being created as nested:

collector.filesize = 191987
crawl_date = 2017-11-08 14:33
document.generatedTitle = Type of Event
document.contentFamily = pdf
collector.lastmodified = 1420215081000
document.filename = Lexington Reimbursement Form 2015
title =
document.reference = file:///data/Clients/ACCOUNTING/2015 Reimbursement Forms/Lexington Reimbursement Form 2015.pdf
scope = [{\"level\":1, \"value\":\"Clients\", \"ancestors\":[]},{\"level\":2, \"value\":\"ACCOUNTING\", \"ancestors\":[\"Clients\"]},{\"level\":3, \"value\":\"2015 Reimbursement Forms\", \"ancestors\":[\"Clients\",\"ACCOUNTING\"]}]
snip =  xxxxx....

Here is my committer section of the config:

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>http://localhost:9200</nodes>
        <indexName>wmsearch</indexName>
        <queueDir>/home/es/elastic/ingest/norconex/workdir-clients/commit</queueDir>
        <jsonFieldsPattern>scope</jsonFieldsPattern>
        <connectionTimeout>5 minutes</connectionTimeout>
        <socketTimeout>5 minutes</socketTimeout>
        <maxRetryTimeout>5 minutes</maxRetryTimeout>
        <typeName>Documents</typeName>
        <queueSize>1000</queueSize>
        <commitBatchSize>50</commitBatchSize>
        <maxRetries>1</maxRetries>
      </committer>

I haven't re-indexed in a while and not sure if there is something I have done to cause this or if perhaps the snapshot was built without this feature? I have also verified that my ES mappings for scope field are of type nested. I am not sure what to try next. Thanks for your help.

jmrichardson commented 7 years ago

I just upgraded to latest snapshot and it works now. :)