Norconex / committer-azuresearch

Implementation of Norconex Committer for Microsoft Azure Search.
https://opensource.norconex.com/committers/azuresearch/
Apache License 2.0
1 stars 2 forks source link

Error when committing #1

Closed jpcoder2 closed 6 years ago

jpcoder2 commented 7 years ago

Getting the following error when attempting to commit the items to the Azure index. Looks like it is maybe complaining about what is being put in the id field, but I don't know where this is coming from. Help? :)

ERROR [JobSuite] Execution failed for job: Norconex Minimum Test Page com.norconex.committer.core.CommitterException: Document field cannot have one or more characters other than letters, numbers and underscores: document.reference

essiembre commented 7 years ago

There is a bunch of fields the Collectors will grab and try to send. If you want to see what they all are, you can temporarily use the FileSystemCommitter, or user the Importer DebugTagger as a pre or post handler.

You have a few options to get around this issue. For instance, you can only send the fields you are interested in if they have no dots with KeepOnlyTagger. You can also rename the fields to remove the dots with RenameTagger.

jpcoder2 commented 7 years ago

What would you normally use when doing an http crawl? I would think the id would be the url. Or at least the url would have to be stored somewhere in the index.

On Mon, Jul 17, 2017 at 12:24 PM Pascal Essiembre notifications@github.com wrote:

There is a bunch of fields the Collectors will grab and try to send. If you want to see what they all are, you can temporarily use the FileSystemCommitter https://www.norconex.com/collectors/committer-core/latest/apidocs/com/norconex/committer/core/impl/FileSystemCommitter.html, or user the Importer DebugTagger https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/DebugTagger.html as a pre or post handler.

You have a few options to get around this issue. For instance, you can only send the fields you are interested in if they have no dots with KeepOnlyTagger https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/KeepOnlyTagger.html. You can also rename the fields to remove the dots with RenameTagger https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/RenameTagger.html .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/committer-azuresearch/issues/1#issuecomment-315822163, or mute the thread https://github.com/notifications/unsubscribe-auth/ACI8n8MOeCB8k5Q9nxBGWjldyDSQIHuOks5sO5jIgaJpZM4OaRlb .

essiembre commented 7 years ago

Yes, that is correct and what the committer should do. See, the Collectors use document.reference to store the "id". But when it comes to Azure Committer, it automatically converts it to id. If you are getting this error, it suggests you are keeping the original field. Is it possible you have true here: <sourceReferenceField keep="true">...</sourceReferenceField>? Make sure it is false (not kept).

jpcoder2 commented 7 years ago

No, I have not set keep to true or anything, so it should default to false, correct?.

jpcoder2 commented 7 years ago

Is there supposed to be other fields in the index other than id and content? Looks like the error would not allow a url to be in it and I would think you would need the url somewhere.

essiembre commented 7 years ago

You control what fields are sent in the end. Please share your Committer config, and also try to use the FilesystemCommitter mentioned earlier to get all fields that would be sent to Azure. Id should be set automatically so using KeepOnlyTagger, maybe you can only keep the few fields you want.

jpcoder2 commented 7 years ago
    <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
        <endpoint>https://jptestsearch.search.windows.net</endpoint>
        <apiKey>MYKEY</apiKey>
        <indexName>norconexha</indexName>
        <targetReferenceField>id</targetReferenceField>
        <targetContentField>content</targetContentField>
    </committer>

Do the fields need to be created first or will Norconex create the index fields in Azure as necessary? Also, I'm not sure what you are talking about with the KeepOnlyTagger. How would I use that?

essiembre commented 7 years ago

By default, all fields discovered are sent to your Committer. The Collector adds some fields, in addition to fields extracted from documents. So there is likely too many fields for what you want. The KeepOnlyTagger can be used this way in site your <crawler ...> section:

<importer>
    <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />
        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>myfield1, myfield2, etc</fields>
        </tagger>
    </postParseHandlers>
</importer>

I have also added an example of DebugTagger so you can list all discovered fields.

jpcoder2 commented 7 years ago

It looks like by default it is trying to map document.reference to the id field in the Azure index. The document.reference seems to have the crawled url in it but I don't think you can put a url in the Azure index id field. Do you have an example of this committer working with the http crawler? Seems like the url would need to get mapped to another index field (still don't see how to do that yet) and id would need to get some kind of value (integer or other simple unique value) mapped to it.

essiembre commented 7 years ago

Have you tried my last suggestion? With KeepOnlyTagger, just specify a field or two to see if it gets through. For instance, what happens if you only keep "title"?

jpcoder2 commented 7 years ago

I set the KeepOnlyTagger fields to only be "title" but I still get the same error as initially reported above where it is trying to put document.reference (the url) into an index column (I'm assuming id) that does not allow anything other than letters, numbers and underscores.

jpcoder2 commented 7 years ago

Do you have an example config of this Azure committer working with an http crawled data source?

essiembre commented 7 years ago

You should no longer be getting document.reference. It may be remains from your previous crawl attempts. Have you tried wiping out your working directory and the committer queue directory? If not the crawler will attempt to send previously unsuccessful documents again.

Will look for a sample when I get a chance.

essiembre commented 7 years ago

The following quick test worked just fine for me. I could see the documents added to Azure without issues.

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="testcollector">

  #set($workdir = "./workdir/azure-test")

  <progressDir>$workdir/progress</progressDir>
  <logsDir>$workdir/logs</logsDir>

  <crawlers>
    <crawler id="testcrawler">

      <userAgent>Identify yourself</userAgent>

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url><![CDATA[https://en.wikipedia.org]]></url>
      </startURLs>

      <workDir>$workdir</workDir>
      <maxDepth>1</maxDepth>
      <maxDocuments>5</maxDocuments>
      <numThreads>1</numThreads>
      <robotsTxt ignore="false" />
      <robotsMeta ignore="true" />
      <sitemapResolverFactory ignore="true"/>
      <delay default="100" />

      <importer>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference, document.contentFamily, document.contentType, content</fields>
          </tagger>            
          <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="document.reference" toField="reference"/>
            <rename fromField="document.contentFamily" toField="contentFamily"/>
            <rename fromField="document.contentType" toField="contentType"/>
          </tagger>            
        </postParseHandlers>
      </importer>      

      <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
          <endpoint>https://XXXXXXX.search.windows.net</endpoint>
          <apiKey>XXXXXXXXXXXXXXXXXXXXXXXXXXXX</apiKey>
          <indexName>XXXXXXXX</indexName>
          <disableReferenceEncoding>false</disableReferenceEncoding>
          <ignoreValidationErrors>false</ignoreValidationErrors>
          <ignoreResponseErrors>false</ignoreResponseErrors>
          <queueDir>${workdir}/committer-queue</queueDir>
      </committer>      

    </crawler>
  </crawlers>

</httpcollector>

I have created the reference, contentFamily, and contentType fields beforehand on my Azure search index.

jpcoder2 commented 7 years ago

Wow, that's quite a lot of extra config from the defaults. :) Thanks for that.

I tried this and it still does not work, but now I'm getting a different error. I have DEBUG level turned on for the committer log. Is there something else I can turn on to get more details here?

ERROR [JobSuite] Execution failed for job: Norconex Minimum Test Page
com.norconex.committer.core.CommitterException: Could not commit JSON batch to A
zure Search.
        at com.norconex.committer.azuresearch.AzureSearchCommitter.commitBatch(A
zureSearchCommitter.java:411)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatc
h(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(Abs
tractBatchCommitter.java:159)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(Abstrac
tFileQueueCommitter.java:233)
        at com.norconex.committer.azuresearch.AzureSearchCommitter.commit(AzureS
earchCommitter.java:331)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractC
rawler.java:270)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(Abstrac
tCrawler.java:226)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(Ab
stractCrawler.java:189)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJ
ob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector
.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(Abstract
CollectorLauncher.java:95)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

Caused by: org.apache.http.conn.HttpHostConnectException: Connect to jptestsearc
h.search.windows.net:443 [jptestsearch.search.windows.net/13.65.194.139] failed:
 Connection timed out: connect
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect
(DefaultHttpClientConnectionOperator.java:158)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(
PoolingHttpClientConnectionManager.java:353)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClie
ntExec.java:380)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.
java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java
:184)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java
:110)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttp
Client.java:184)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttp
Client.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttp
Client.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttp
Client.java:55)
        at com.norconex.committer.azuresearch.AzureSearchCommitter.commitBatch(A
zureSearchCommitter.java:403)
        ... 14 more
Caused by: java.net.ConnectException: Connection timed out: connect
        at java.net.DualStackPlainSocketImpl.connect0(Native Method)
        at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
        at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
        at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
        at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
        at java.net.PlainSocketImpl.connect(Unknown Source)
        at java.net.SocksSocketImpl.connect(Unknown Source)
        at java.net.Socket.connect(Unknown Source)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSL
ConnectionSocketFactory.java:337)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect
(DefaultHttpClientConnectionOperator.java:141)
        ... 25 more
essiembre commented 7 years ago

The relevant portion is

Connect to jptestsearch.search.windows.net:443 [jptestsearch.search.windows.net/13.65.194.139] failed:  Connection timed out: connect

Looks like your Azure Search is not accessible from where you are crawling. Can you access it using "wget", "telnet", or some other commands to confirm you can access it from that server? Maybe you are using a proxy, or maybe access has to be granted from Azure?

jpcoder2 commented 7 years ago

I am using a proxy. I have proxy settings setup in the config for the crawl and they are working. Are their separate proxy settings for the committer?

essiembre commented 7 years ago

That is the issue then. No, unfortunately, the proxy settings used for crawling are not applied to the committer connections. I will make this a feature request for the next significant release.

In the meantime, you should be able to make it work by setting the proxy settings on the JVM by modifying the launch script and adding this after "java":

java -Dhttp.proxyHost=yourProxyHost -Dhttp.proxyPort=999 ...
jpcoder2 commented 7 years ago

I just tried that and it got the same connection error. It doesn't look like it even attempted to use the proxy.

essiembre commented 7 years ago

Please try the same JVM options (as defined here) with this new snapshot.

jpcoder2 commented 7 years ago

Ok, it's getting closer. :) It is attempted to use the proxy now, but the proxy is getting a 407 Proxy Authentication Required error. Unlike the http crawler which is working, this does not seem to be using the current user windows authentication to make the proxy connection.

jpcoder2 commented 7 years ago

Any way to make it use the same proxy config and network access means as the crawler since that seems to work?

essiembre commented 7 years ago

Because Committers are independent of the collectors that use them (e.g. might be used with Filesystem Collector that has no HTTP connections), and because web crawling and committing can be on different networks (different proxy requirements), then it cannot safely rely on the HTTP Collector connection settings.

We can add the same proxy options though. I'll mark this as a feature request to add support for proxy in the same way it is done for the HTTP Collector HttpClient configuration.

jpcoder2 commented 7 years ago

Right, and that is what I meant. Not actually use the same part of the config, but for it to use the same type of parameters and and the http connection mechanism since the crawler does work correctly with the proxy. Any idea what kind of timeframe we would be looking at here? I was looking at this for a project that will go in by the end of the year. If it will be longer than that I might need to look elsewhere, build my own or fix it myself.

essiembre commented 7 years ago

I usually don't give timelines here, but I will likely have a new release for you to test within a day or two.

jpcoder2 commented 7 years ago

That would be awesome! Thanks!

On Wed, Jul 19, 2017 at 5:09 PM Pascal Essiembre notifications@github.com wrote:

I usually don't give timelines here, but I will likely have a new release for you to test within a day or two.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/committer-azuresearch/issues/1#issuecomment-316533106, or mute the thread https://github.com/notifications/unsubscribe-auth/ACI8n8Io5AhIBGuM8C8aVesmyXqgptQwks5sPn6BgaJpZM4OaRlb .

essiembre commented 7 years ago

Try the new 1.1.0-SNAPSHOT. You will find the same proxy configuration options, described here.

Some dependencies were updated as well, so I suggest you use the install script so you do not miss any of them.

jpcoder2 commented 7 years ago

I'm not sure I really follow you here. First, the snapshot link you have above seems to only show a 1.1.0 snapshot, not 2.1.0. I did download it and install it with the install script though. After that, I'm not sure what I need to do. The "described here" link above does not seem to show any info about proxy configurations. I tried putting the config in the section just like for the http crawler but that did not seem to work. What configuration should be necessary now?

Thanks again for taking a look at this.

essiembre commented 7 years ago

You are right, I should have written 1.1.0-SNAPSHOT, which I have corrected. The "here" link has the updated javadoc with configuration usage, which lists the proxy options. You may need to do a SHIFT-reload on your browser if you do not see them (or clear your browser cache).

jpcoder2 commented 7 years ago

Ha, I thought I tried refreshing but I guess I didn't. :) Ok, I see the config now. Looks like it is basically the same config as the but without the parent element. I tried the same settings I was using for the crawler, and it does show that it is using the proxy but I'm getting an error that "Username may not be null". The crawler proxy config does not require the user name and is using the current windows auth. Could this do the same? Would be much more convenient then having to hard code credentials in the config.

essiembre commented 7 years ago

Just made a new snapshot release of the Committer that should now behave the same. Please confirm.

jpcoder2 commented 7 years ago

Ok, now got this:

java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.put(Unknown Source)
        at org.apache.http.impl.client.BasicCredentialsProvider.setCredentials(B
asicCredentialsProvider.java:61)
        at com.norconex.commons.lang.net.ProxySettings.createCredentialsProvider
(ProxySettings.java:147)
        at com.norconex.committer.azuresearch.AzureSearchCommitter.buildHttpClie
nt(AzureSearchCommitter.java:646)
        at com.norconex.committer.azuresearch.AzureSearchCommitter.nullSafeHttpC
lient(AzureSearchCommitter.java:634)
        at com.norconex.committer.azuresearch.AzureSearchCommitter.commitBatch(A
zureSearchCommitter.java:416)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatc
h(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(Abs
tractBatchCommitter.java:159)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(Abstrac
tFileQueueCommitter.java:233)
        at com.norconex.committer.azuresearch.AzureSearchCommitter.commit(AzureS
earchCommitter.java:387)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractC
rawler.java:270)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(Abstrac
tCrawler.java:226)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(Ab
stractCrawler.java:189)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJ
ob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector
.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(Abstract
CollectorLauncher.java:95)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
essiembre commented 7 years ago

OK, I just made a new snapshot release fixing the NPE. Please confirm.

Thanks for being a good tester on that one.

jpcoder2 commented 7 years ago

Ok, looks like auth isn't being sent at all now. FYI, I'll have to test more tomorrow.

com.norconex.committer.core.CommitterException: Invalid HTTP response: "HTTP/1.1
 407 Proxy Authentication Required ( Forefront TMG requires authorization to ful
fill the request. Access to the Web Proxy filter is denied.  )". Azure Response:
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/
TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>

<title>The page cannot be displayed</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

<style>
body {
essiembre commented 7 years ago

Yes, but you are not setting the username and password, right? In which case it will not be sent. I do not know if HttpClient will try to grab Windows credentials as a fallback in such case. You can try to pass the credentials and you can encrypt the password if that's a concern.

jpcoder2 commented 7 years ago

But the proxy settings for the crawl do not require the username and password and are passing the current user without a problem. What is the difference? I'd rather not hard code auth values here, encrypted or not, since that would cause a maintenance problem.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Pascal Essiembre notifications@github.com Sent: Thursday, July 20, 2017 5:03:36 PM To: Norconex/committer-azuresearch Cc: jpcoder2; Author Subject: Re: [Norconex/committer-azuresearch] Error when committing (#1)

Yes, but you are not setting the username and password, right? In which case it will not be sent. I do not know if HttpClient will try to grab Windows credentials as a fallback in such case. You can try to pass the credentials and you can encrypt the password if that's a concern.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Norconex/committer-azuresearch/issues/1#issuecomment-316843029, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACI8n6IL9_xaeOgleW91tUE9ZZycJERgks5sP864gaJpZM4OaRlb.

essiembre commented 7 years ago

I can't tell what makes the difference. The code for the crawler is here. If you can spot what makes it different in regards to using your proxy, let me know.

jpcoder2 commented 7 years ago

I'm not a java programmer so I may be reading this wrong, but it looks like the crawler is using it's own createCredentialsProvider which seems to return null if now username is given. The committer instead seems to use a createCredentialsProvider method from a com.norconex.commons.lang.net.ProxySettings class. I have not been able to find this class so I'm not sure what could be different there. If you can tell me the location of this code I might be able to research further.

essiembre commented 7 years ago

That code is here. I may have a second look myself, but without the same environment I can't reproduce to troubleshoot further.

jpcoder2 commented 7 years ago

Even though the code is quite different I'm not seeing a functional difference, but it's hard to tell without stepping thru the code or more logging. Would it be possible to add more logging to the proxy setting and http connection code?

jpcoder2 commented 7 years ago

Or if you want to tell me how I could debug it I might be able to do that also. I haven't done any java dev in many years so I don't know what tools are best.

jpcoder2 commented 7 years ago

You know what, I think the problem may be that the site I'm crawling may not need proxy authentication (it's a company site, just accessing thru the proxy) and the azure site does need authentication, so neither may actually be passing windows auth automatically. Let me do a little more research.

essiembre commented 7 years ago

That would explain. In any case, if you want more logging for connections issues, you can change the log level to DEBUG or even TRACE. I would start by doing it for Apache, changing this line in the log4j.properties file:

log4j.logger.org.apache=DEBUG

But if you expect windows credentials to be passed, it suggests a proxy using NTLM. Apache released a version of its HttpClient library that is more recent than the one shipped with the HTTP Collector and that adds supports for NTLM. More info: https://issues.apache.org/jira/browse/HTTPCLIENT-1779

You may try upgrading that "httpclient" library.

jpcoder2 commented 7 years ago

I tried upgrading to the 4.5.3 library and that alone did not seem to change anything. I did find this sample code that seems to state that Integrate Windows Auth can be used (although it doesn't talk about a proxy) but a special Windows version of the httpclient has to be used. What are your thoughts on this? It look like an option?

http://hc.apache.org/httpcomponents-client-4.4.x/httpclient-win/examples/org/apache/http/examples/client/win/ClientWinAuth.java

jpcoder2 commented 7 years ago

FYI, changing log4j.logger.org.apache to DEBUG or TRACE seemed to have no effect on log output.

essiembre commented 7 years ago

I just tried and I got tons of new entries in the log. Did you change the log4j.properties located in the Collector install directory? Do you start it using the collector-http.bat script?

I will have a look at your provided link when I get a chance.

jpcoder2 commented 7 years ago

Thanks, but I had to drop the usage of norconex. I found another option that works in my scenario and I have more control of since it's in C#. It's not as built out and flexible as norconex, but I can mold it to what we need. Thanks for all your help though. Feel free to close this issue if noone else needs the proxy auth to work.

essiembre commented 7 years ago

No problem, I hope you'll come back when you need it.

It turns out integrating the Win Auth solution you found was fairly simple. I made a snapshot release that supports it with a new <useWindowsAuth>true<useWindowsAuth> flag. Even if you are no longer actively using it, you can help by testing it if you have a chance (I would try without proxy settings first).

Thanks for your valuable input.

essiembre commented 6 years ago

Win Auth is now part of the official release of Azure Search Committer 1.1.0.