Newbie question: Index field configuration for CloudSearch?

pcolmer commented 8 years ago

I'm intending to use the HTTP collector and have that configured to use the CloudSearch committer. What should I do in order to configure the index fields for CloudSearch?

I can see that CloudSearch has some suggested index field configurations for HTML, PDF, etc but the fields are different. So, if I'm crawling a web site and come across a PDF, what does the CloudSearch committer do?

essiembre commented 8 years ago

Thanks for your interest in HTTP Collector.
Committer products are for the most part simply sending to your repository (CloudSearch in your case) document content and metadata fields as it receives them. It is the bridge between the HTTP Collector (the crawler) that sends the data to your search engine, but it does not itself do any crawling/extracting of info.

There are a bunch of fields that are naturally added to each documents being crawled, like HTTP header fields, document properties, and any other fields you added yourself via performing document manipulation with the Importer module (configurable part of the HTTP Collector). Those will be all sent to CloudSearch by default. That's probably way too many fields. You can easily restrict to keep only those you want to see indexed by using a KeepOnlyTagger as a post-parse handler in the importer section of your collector configuration.

You can also rename remaining field names to match what you have defined in CloudSearch with the RenameTagger

If you want to have an idea of what fields are being found when crawling before you define them all in CloudSearch, I encourage you to first use the FileSystem Committer as described in the configuration examples shipping with the HTTP collector. You can then read the generated output files that would normally be sent to CloudSearch. Alternatively, you can also have a look at the DebugTagger to have information printed out for easy troubleshooting.

Any clearer?

pcolmer commented 8 years ago

Thanks for your reply.

I've taken your advice and starting to test with the FileSystem Committer although I'm finding the directory structure (date and time) makes it trickier to figure out what is actually being committed. I can't find any documentation for that committer to see if that is adjustable.

With the CloudSearch committer, does the committer create the fields at the remote end if they don't already exist or do they just get "dropped"? The reason for asking that is for me to understand whether or not I have to create index fields on CloudSearch ahead of doing any crawling so that the data is accurately captured.

Thank you for the great tools.

essiembre commented 8 years ago

Whatever files are present when using file system committer represent what would have been sent using any other committers.

As for CloudSearch and handling of fields, it depends how you have configured it. You may be able to have CloudSearch create new fields "on-the-fly" (for you to investigate), but in many cases, you want to pre-define your fields in CloudSearch. Not only does this let you control what the fields are, but this grants you better control and optimize each fields the way they should be (e.g. date fields are of date type, numeric are numeric type, etc).

So my recommendation is to create the fields you need for your project, and configure the HTTP Collector to make sure those fields are extracted/defined to match the ones in CloudSearch.

pcolmer commented 8 years ago

Hi

I've started experimenting with Amazon's suggested fields for HTML but I'm hitting a problem when running the code after configuring it for the CloudSearch committer:

DEBUG [ConfigurationUtil] Could not instantiate object from configuration for "crawlerDefaults -> committer".
java.lang.UnsupportedOperationException: Target reference field is always "id" and cannot be changed.
    at com.norconex.committer.cloudsearch.CloudSearchCommitter.setTargetReferenceField(CloudSearchCommitter.java:214)
    at com.norconex.committer.core.AbstractMappedCommitter.loadFromXML(AbstractMappedCommitter.java:372)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:205)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:338)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:291)
    at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:348)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:123)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:72)
    at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:183)
    at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:76)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

I don't have much in the config for CloudSearch:

<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">
  <documentEndpoint>url.cloudsearch.amazonaws.com</documentEndpoint>
  <accessKey>something here</accessKey>
  <secretKey>something here</secretKey>
  <sourceReferenceField>document.reference</sourceReferenceField>
</committer>

I'm assuming that the UnsupportedOperationException is what is triggering the "Could not instantiate object" error, but I don't know what to do to stop the exception from happening.

Thanks.

essiembre commented 8 years ago

Congrats for finding the first bug with the CloudSearch commiter! :-) I just made a snapshot release with a fix. You can extract the zip and simply replace the committer-cloudsearch-1.0.0.jar with the new committer-cloudsearch-1.0.1-SNAPSHOT.jar and let me know if that resolves your issue.

pcolmer commented 8 years ago

Hi

The snapshot has fixed the exception. However, I'm not seeing the committer being called in the logs. I've attached a snippet of the log and you can see the various activities, but CloudSearch isn't one of them :(

Thanks. Log.txt

essiembre commented 8 years ago

It is there: whenever you see DOCUMENT_COMMITTED_ADD it means a document was sent to your committer for addition. So if the CloudSearch committer is what you are using, it is being called. Are you not seeing documents added to cloud search?

pcolmer commented 8 years ago

Correct - I'm not seeing documents being added to CloudSearch. I suspect that I've got my CloudSearch configuration wrong somewhere. I was hoping that ramping up the logging on the Norconex code to DEBUG all over the place might help me figure out where but even putting DOCUMENT_COMMITTED_ADD to Debug just confirms that it is passing it on to CloudSearch but isn't reporting any unsuccessful status, which is what I was hoping for.

So, as it stands, the Norconex software thinks everything is OK but CloudSearch is reporting 0 documents and there doesn't seem to be any CloudSearch log to dig a little deeper there.

I'll see if I can do some HTTP debugging.

essiembre commented 8 years ago

I checked the code, and if you have not modified the log4j.properties, the default log level for it should be INFO. As such, you should see logging statements in the log file. Like this one:

"Sending 10 documents to AWS CloudSearch for addition/deletion."

Since you are not getting those in your logs, I suspect your committer config is not being picked up at all. Can you attach your full config?

essiembre commented 8 years ago

Another test you can do is put a fake documentEndpoint or keys... it should fail for sure in this case. If it does not fail with errors, then your committer section is likely misconfigured.

essiembre commented 8 years ago

FYI, 1.0.1 (stable) was just released with the fix for the exception you reported.

pcolmer commented 8 years ago

Hi

I've attached the various files that go together to make the config.

Linaro files.zip

I've tried running it with the wrong keys and I'm not getting errors.

Thank you for your help.

Philip

essiembre commented 8 years ago

I tried reproduce with your config and I can see the committer gets invoked properly. I am not sure why it does not in your case.

To confirm that, at at minimum, the CloudSearch committer is being recognized, you should see its version printed in the first few lines of your log file, like this:

INFO  [AbstractCollector] Version: "Collector" version is x.x.x.
INFO  [AbstractCollector] Version: "Collector Core" version is x.x.x.
INFO  [AbstractCollector] Version: "Importer" version is x.x.x.
INFO  [AbstractCollector] Version: "JEF" version is x.x.x.
INFO  [AbstractCollector] Version: "Committer Core" version is x.x.x.
INFO  [AbstractCollector] Version: "CloudSearchCommitter" version is x.x.x.

Can you please attach your log file in case there is something suspicious to be found?

pcolmer commented 8 years ago

Log file attached. Please note that I have replaced the document end point, access key and secret key. I have also double-checked that I'm running the latest version of the committer and associated JAR files but the version number for the committer is NOT being displayed.

Wiki_32_Crawler.zip

essiembre commented 8 years ago

I may have found the cause from your log. Did you send all of it? Because I see near the end:

INFO - Wiki Crawler: 2% completed (127 processed/5300 total)

It turns out the committer queues documents before sending them in batches. If you do not overwrite the defaults, it will queue documents to be sent, and will only send them every time there are 1000 in the queue. So I recommend you try again with smaller queue size. You can add this to your committer config (replacing 10 with whatever value):

      <queueSize>10</queueSize>

Let me know if that makes a difference.

pcolmer commented 8 years ago

Thank you :) That certainly makes a difference in the behaviour although, unfortunately, I still have zero documents ... but at least I'm seeing errors now and I'm getting closer. I am now in a position where I can start to work on the fields that I want the collector to pass over to the committer and, hopefully, I'll soon have something to search.

essiembre commented 8 years ago

Were you able to add documents to CloudSearch? Can we close this issue?

FYI, a new release of the CloudSearch Committer was just made. It was updated to use the latest AWS CloudSearch libraries. It now also ships with an install script that helps copy Jars for you and resolve any conflicting versions.

pcolmer commented 8 years ago

My apologies - I was in a conference last week, I'm off sick from work at the moment and on holiday next week so my productivity has gone down.

Can we please leave this open for now? I think I know what I need to do but it will take some time for me to get the various settings correct. Hopefully, though, I'll be able to share a more complete configuration for CloudSearch Committer that works with the AWS default index configuration.

Thanks.

pcolmer commented 8 years ago

Just for the record, and in case it helps anyone else, the default index fields for CloudSearch for HTTP "documents" are:

author
content
content_encoding
- Maps to document.contentEncoding or collector.content-encoding?
content_language
content_type
- Maps to document.contentType or collector.content-type?
description
generator
keywords
resourcename
title
- Maps to title

CloudSearch insists that field names are lowercase.

I'm trying to figure the best mapping from the crawler to the committer ...

pcolmer commented 8 years ago

BTW, love the new install script. Makes life a LOT easier. Thank you for doing that.

essiembre commented 8 years ago

Thanks for your feedback. The CloudSearch predefined configurations are just to help you get started and their default fields are just to help you, and by no mean a standard you have to follow. You can easily ignore/remove them and just create fields you really need for your project. But if you want to keep using the default CloudSearch fields, a combination of RenameTagger and KeepOnlyTagger should get you there.

pcolmer commented 8 years ago

Sorry about this but I'm still struggling. As you've suggested, I'm using RenameTagger and KeepOnlyTagger but AWS is complaining about a whole bunch of fields that the KeepOnlyTagger should have removed:

DEBUG [AmazonHttpClient] Received error response: com.amazonaws.services.cloudsearchdomain.model.DocumentServiceException: { ["Field name (collector.content-type) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 1; document_id https://wiki.linaro.org/FrontPage)","Field name (document.contentFamily) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 1; document_id https://wiki.linaro.org/FrontPage)","Field name (Server) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 1; document_id https://wiki.linaro.org/FrontPage)","Field name (Content-Location) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 1; document_id https://wiki.linaro.org/FrontPage)","Field name (Cf-Railgun) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 1; document_id https://wiki.linaro.org/FrontPage)","Field name (dc:title) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 1; document_id https://wiki.linaro.org/FrontPage)","Field name (collector.is-crawl-new) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 1; document_id https://wiki.linaro.org/FrontPage)","Field name (Content-Encoding) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 1; document_id https://wiki.linaro.org/FrontPage)","Field name (Content-Type-Hint) must match the regex [a-z0-9][a-z0-9_]{0,63}$ <snip>

The logging from KeepOnlyTagger seems to confirm that it is removing the tags I don't want:

DEBUG [KeepOnlyTagger] Removed metadata fields "collector.content-type,document.contentFamily,Server,CF-RAY,Connection,Last-Modified,Date,document.reference,CF-Cache-Status,Cache-Control,collector.is-crawl-new,Content-Disposition,Vary,collector.depth,Expires,Content-Length,Content-Type" from https://wiki.linaro.org/FrontPage?action=AttachFile&do=get&target=downloads.png

Here is the importer config:

      <importer>
        <preParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="document.contentEncoding" toField="content_encoding" overwrite="true" />
            <rename fromField="document.contentType" toField="content_type" overwrite="true" />
            <rename fromField="CONTENT" toField="content" overwrite="true" />
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>content_encoding, content_type, title, content</fields>
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
              logContent="true" >
          </tagger>
        </preParseHandlers>
      </importer>

I've tried both pre and post ParseHandlers (post initially then switching to pre when I got the error) but it doesn't seem to solve the problem.

pcolmer commented 8 years ago

@essiembre Any thoughts on the problem I'm having with the tags per my last comment?

Thanks.

essiembre commented 8 years ago

Sorry for the late reply. The likely cause is that you remove unwanted fields "before" you parse the document. Parsing a document is where many fields are discovered. I recommend you move your KeepOnlyTagger as the last thing you do in a <postParseHandlers> section instead.

pcolmer commented 7 years ago

I had previously tried using postParseHandlers. So, here is the config now:

  <importer>
    <preParseHandlers>
      <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
        <rename fromField="document.contentEncoding" toField="content_encoding" overwrite="true" />
        <rename fromField="document.contentType" toField="content_type" overwrite="true" />
        <rename fromField="CONTENT" toField="content" overwrite="true" />
      </tagger>
    </preParseHandlers>
    <postParseHandlers>
      <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
        <fields>content_encoding, content_type, title, content</fields>
      </tagger>
      <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
          logContent="false" >
      </tagger>
    </postParseHandlers>
  </importer>

and here is a snippet of the debug output:

Wiki Crawler: 2016-10-03 10:58:10 DEBUG - Creating new input stream from memory cache.
Wiki Crawler: 2016-10-03 10:58:10 DEBUG - Removed metadata fields "Transfer-Encoding,collector.content-type,document.contentFamily,X-Parsed-By,Server,CF-RAY,collector.content-encoding,Content-Location,Connection,Date,document.reference,Cf-Railgun,dc:title,collector.is-crawl-new,Content-Encoding,Content-Type-Hint,Vary,collector.depth,robots,Content-Type" from https://wiki.linaro.org/LHG
Wiki Crawler: 2016-10-03 10:58:10 DEBUG - content_encoding=utf-8
Wiki Crawler: 2016-10-03 10:58:10 DEBUG - title=LHG - Linaro Wiki
Wiki Crawler: 2016-10-03 10:58:10 DEBUG - content_type=text/html
Wiki Crawler: 2016-10-03 10:58:10 INFO -         DOCUMENT_IMPORTED: https://wiki.linaro.org/LHG

So, it would seem that the postParseHandlers are working as I want them to. However, when it comes to submitting the content to Amazon, it still goes wrong:

"Field name (collector.content-type) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Transfer-Encoding) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (SITE) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (document.contentFamily) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (X-Parsed-By) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Server) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (CF-RAY) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (collector.content-encoding) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Content-Location) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Connection) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (document.contentEncoding) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Date) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Cf-Railgun) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (dc:title) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (collector.is-crawl-new) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (document.contentType) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Content-Encoding) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Content-Type-Hint) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Vary) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (collector.depth) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field "robots" does not exist in domain configuration (near operation with index 2; document_id https://wiki.linaro.org/LHG)"
"Field name (Content-Type) must match the regex [a-z0-9][a-z0-9_]{0,63}$ (near operation with index 2; document_id https://wiki.linaro.org/LHG)"

I'm clearly misunderstanding something fundamental here, but I'm struggling to see what. I've configured the importer to only keep the fields I want, so why is the exporter passing on fields that I should have removed?

essiembre commented 7 years ago

Just to make sure it is not trying to resend old docs, can you delete your "workdir" and if you have specified a "queueDir" for your committer, please delete that as well. This will ensure you start "fresh". Let me know if that makes a difference.

pcolmer commented 7 years ago

Hi.

I've been completely deleting the workdir each time I run this, unfortunately. As far as I can tell, no queueDir is configured.

essiembre commented 7 years ago

If you have not configured one, you should be finding a "committer-queue" folder somewhere. Make sure to delete it. You may want to specify one explicitly. If you have multiple crawlers defined in a single collector, make sure they have different paths. If deleting it does not change anything, can you share your full config?

pcolmer commented 7 years ago

Thanks for the pointer about the "committer-queue" folder. I have now found it, deleted it and now everything is working. Thank you for your patience.

essiembre commented 7 years ago

Great, thanks for confirming!

Norconex / committer-cloudsearch

Newbie question: Index field configuration for CloudSearch? #1