Closed jpcoder2 closed 6 years ago
There is a bunch of fields the Collectors will grab and try to send. If you want to see what they all are, you can temporarily use the FileSystemCommitter, or user the Importer DebugTagger as a pre or post handler.
You have a few options to get around this issue. For instance, you can only send the fields you are interested in if they have no dots with KeepOnlyTagger. You can also rename the fields to remove the dots with RenameTagger.
What would you normally use when doing an http crawl? I would think the id would be the url. Or at least the url would have to be stored somewhere in the index.
On Mon, Jul 17, 2017 at 12:24 PM Pascal Essiembre notifications@github.com wrote:
There is a bunch of fields the Collectors will grab and try to send. If you want to see what they all are, you can temporarily use the FileSystemCommitter https://www.norconex.com/collectors/committer-core/latest/apidocs/com/norconex/committer/core/impl/FileSystemCommitter.html, or user the Importer DebugTagger https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/DebugTagger.html as a pre or post handler.
You have a few options to get around this issue. For instance, you can only send the fields you are interested in if they have no dots with KeepOnlyTagger https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/KeepOnlyTagger.html. You can also rename the fields to remove the dots with RenameTagger https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/RenameTagger.html .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/committer-azuresearch/issues/1#issuecomment-315822163, or mute the thread https://github.com/notifications/unsubscribe-auth/ACI8n8MOeCB8k5Q9nxBGWjldyDSQIHuOks5sO5jIgaJpZM4OaRlb .
Yes, that is correct and what the committer should do. See, the Collectors use document.reference
to store the "id". But when it comes to Azure Committer, it automatically converts it to id
. If you are getting this error, it suggests you are keeping the original field. Is it possible you have true
here: <sourceReferenceField keep="true">...</sourceReferenceField>
? Make sure it is false (not kept).
No, I have not set keep to true or anything, so it should default to false, correct?.
Is there supposed to be other fields in the index other than id and content? Looks like the error would not allow a url to be in it and I would think you would need the url somewhere.
You control what fields are sent in the end. Please share your Committer config, and also try to use the FilesystemCommitter mentioned earlier to get all fields that would be sent to Azure. Id should be set automatically so using KeepOnlyTagger, maybe you can only keep the few fields you want.
<committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
<endpoint>https://jptestsearch.search.windows.net</endpoint>
<apiKey>MYKEY</apiKey>
<indexName>norconexha</indexName>
<targetReferenceField>id</targetReferenceField>
<targetContentField>content</targetContentField>
</committer>
Do the fields need to be created first or will Norconex create the index fields in Azure as necessary? Also, I'm not sure what you are talking about with the KeepOnlyTagger. How would I use that?
By default, all fields discovered are sent to your Committer. The Collector adds some fields, in addition to fields extracted from documents. So there is likely too many fields for what you want. The KeepOnlyTagger
can be used this way in site your <crawler ...>
section:
<importer>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO" />
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>myfield1, myfield2, etc</fields>
</tagger>
</postParseHandlers>
</importer>
I have also added an example of DebugTagger so you can list all discovered fields.
It looks like by default it is trying to map document.reference to the id field in the Azure index. The document.reference seems to have the crawled url in it but I don't think you can put a url in the Azure index id field. Do you have an example of this committer working with the http crawler? Seems like the url would need to get mapped to another index field (still don't see how to do that yet) and id would need to get some kind of value (integer or other simple unique value) mapped to it.
Have you tried my last suggestion? With KeepOnlyTagger, just specify a field or two to see if it gets through. For instance, what happens if you only keep "title"?
I set the KeepOnlyTagger fields to only be "title" but I still get the same error as initially reported above where it is trying to put document.reference (the url) into an index column (I'm assuming id) that does not allow anything other than letters, numbers and underscores.
Do you have an example config of this Azure committer working with an http crawled data source?
You should no longer be getting document.reference. It may be remains from your previous crawl attempts. Have you tried wiping out your working directory and the committer queue directory? If not the crawler will attempt to send previously unsuccessful documents again.
Will look for a sample when I get a chance.
The following quick test worked just fine for me. I could see the documents added to Azure without issues.
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="testcollector">
#set($workdir = "./workdir/azure-test")
<progressDir>$workdir/progress</progressDir>
<logsDir>$workdir/logs</logsDir>
<crawlers>
<crawler id="testcrawler">
<userAgent>Identify yourself</userAgent>
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url><![CDATA[https://en.wikipedia.org]]></url>
</startURLs>
<workDir>$workdir</workDir>
<maxDepth>1</maxDepth>
<maxDocuments>5</maxDocuments>
<numThreads>1</numThreads>
<robotsTxt ignore="false" />
<robotsMeta ignore="true" />
<sitemapResolverFactory ignore="true"/>
<delay default="100" />
<importer>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>document.reference, document.contentFamily, document.contentType, content</fields>
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
<rename fromField="document.reference" toField="reference"/>
<rename fromField="document.contentFamily" toField="contentFamily"/>
<rename fromField="document.contentType" toField="contentType"/>
</tagger>
</postParseHandlers>
</importer>
<committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
<endpoint>https://XXXXXXX.search.windows.net</endpoint>
<apiKey>XXXXXXXXXXXXXXXXXXXXXXXXXXXX</apiKey>
<indexName>XXXXXXXX</indexName>
<disableReferenceEncoding>false</disableReferenceEncoding>
<ignoreValidationErrors>false</ignoreValidationErrors>
<ignoreResponseErrors>false</ignoreResponseErrors>
<queueDir>${workdir}/committer-queue</queueDir>
</committer>
</crawler>
</crawlers>
</httpcollector>
I have created the reference
, contentFamily
, and contentType
fields beforehand on my Azure search index.
Wow, that's quite a lot of extra config from the defaults. :) Thanks for that.
I tried this and it still does not work, but now I'm getting a different error. I have DEBUG level turned on for the committer log. Is there something else I can turn on to get more details here?
ERROR [JobSuite] Execution failed for job: Norconex Minimum Test Page
com.norconex.committer.core.CommitterException: Could not commit JSON batch to A
zure Search.
at com.norconex.committer.azuresearch.AzureSearchCommitter.commitBatch(A
zureSearchCommitter.java:411)
at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatc
h(AbstractBatchCommitter.java:179)
at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(Abs
tractBatchCommitter.java:159)
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(Abstrac
tFileQueueCommitter.java:233)
at com.norconex.committer.azuresearch.AzureSearchCommitter.commit(AzureS
earchCommitter.java:331)
at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractC
rawler.java:270)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(Abstrac
tCrawler.java:226)
at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(Ab
stractCrawler.java:189)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJ
ob.java:49)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
at com.norconex.collector.core.AbstractCollector.start(AbstractCollector
.java:132)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(Abstract
CollectorLauncher.java:95)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to jptestsearc
h.search.windows.net:443 [jptestsearch.search.windows.net/13.65.194.139] failed:
Connection timed out: connect
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect
(DefaultHttpClientConnectionOperator.java:158)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(
PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClie
ntExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.
java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java
:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java
:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttp
Client.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttp
Client.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttp
Client.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttp
Client.java:55)
at com.norconex.committer.azuresearch.AzureSearchCommitter.commitBatch(A
zureSearchCommitter.java:403)
... 14 more
Caused by: java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.connect0(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSL
ConnectionSocketFactory.java:337)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect
(DefaultHttpClientConnectionOperator.java:141)
... 25 more
The relevant portion is
Connect to jptestsearch.search.windows.net:443 [jptestsearch.search.windows.net/13.65.194.139] failed: Connection timed out: connect
Looks like your Azure Search is not accessible from where you are crawling. Can you access it using "wget", "telnet", or some other commands to confirm you can access it from that server? Maybe you are using a proxy, or maybe access has to be granted from Azure?
I am using a proxy. I have proxy settings setup in the config for the crawl and they are working. Are their separate proxy settings for the committer?
That is the issue then. No, unfortunately, the proxy settings used for crawling are not applied to the committer connections. I will make this a feature request for the next significant release.
In the meantime, you should be able to make it work by setting the proxy settings on the JVM by modifying the launch script and adding this after "java":
java -Dhttp.proxyHost=yourProxyHost -Dhttp.proxyPort=999 ...
I just tried that and it got the same connection error. It doesn't look like it even attempted to use the proxy.
Ok, it's getting closer. :) It is attempted to use the proxy now, but the proxy is getting a 407 Proxy Authentication Required error. Unlike the http crawler which is working, this does not seem to be using the current user windows authentication to make the proxy connection.
Any way to make it use the same proxy config and network access means as the crawler since that seems to work?
Because Committers are independent of the collectors that use them (e.g. might be used with Filesystem Collector that has no HTTP connections), and because web crawling and committing can be on different networks (different proxy requirements), then it cannot safely rely on the HTTP Collector connection settings.
We can add the same proxy options though. I'll mark this as a feature request to add support for proxy in the same way it is done for the HTTP Collector HttpClient configuration.
Right, and that is what I meant. Not actually use the same part of the config, but for it to use the same type of parameters and and the http connection mechanism since the crawler does work correctly with the proxy. Any idea what kind of timeframe we would be looking at here? I was looking at this for a project that will go in by the end of the year. If it will be longer than that I might need to look elsewhere, build my own or fix it myself.
I usually don't give timelines here, but I will likely have a new release for you to test within a day or two.
That would be awesome! Thanks!
On Wed, Jul 19, 2017 at 5:09 PM Pascal Essiembre notifications@github.com wrote:
I usually don't give timelines here, but I will likely have a new release for you to test within a day or two.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/committer-azuresearch/issues/1#issuecomment-316533106, or mute the thread https://github.com/notifications/unsubscribe-auth/ACI8n8Io5AhIBGuM8C8aVesmyXqgptQwks5sPn6BgaJpZM4OaRlb .
Try the new 1.1.0-SNAPSHOT. You will find the same proxy configuration options, described here.
Some dependencies were updated as well, so I suggest you use the install script so you do not miss any of them.
I'm not sure I really follow you here. First, the snapshot link you have above seems to only show a 1.1.0 snapshot, not 2.1.0. I did download it and install it with the install script though. After that, I'm not sure what I need to do. The "described here" link above does not seem to show any info about proxy configurations. I tried putting the
Thanks again for taking a look at this.
You are right, I should have written 1.1.0-SNAPSHOT, which I have corrected. The "here" link has the updated javadoc with configuration usage, which lists the proxy options. You may need to do a SHIFT-reload on your browser if you do not see them (or clear your browser cache).
Ha, I thought I tried refreshing but I guess I didn't. :) Ok, I see the config now. Looks like it is basically the same config as the
Just made a new snapshot release of the Committer that should now behave the same. Please confirm.
Ok, now got this:
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.put(Unknown Source)
at org.apache.http.impl.client.BasicCredentialsProvider.setCredentials(B
asicCredentialsProvider.java:61)
at com.norconex.commons.lang.net.ProxySettings.createCredentialsProvider
(ProxySettings.java:147)
at com.norconex.committer.azuresearch.AzureSearchCommitter.buildHttpClie
nt(AzureSearchCommitter.java:646)
at com.norconex.committer.azuresearch.AzureSearchCommitter.nullSafeHttpC
lient(AzureSearchCommitter.java:634)
at com.norconex.committer.azuresearch.AzureSearchCommitter.commitBatch(A
zureSearchCommitter.java:416)
at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatc
h(AbstractBatchCommitter.java:179)
at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(Abs
tractBatchCommitter.java:159)
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(Abstrac
tFileQueueCommitter.java:233)
at com.norconex.committer.azuresearch.AzureSearchCommitter.commit(AzureS
earchCommitter.java:387)
at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractC
rawler.java:270)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(Abstrac
tCrawler.java:226)
at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(Ab
stractCrawler.java:189)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJ
ob.java:49)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
at com.norconex.collector.core.AbstractCollector.start(AbstractCollector
.java:132)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(Abstract
CollectorLauncher.java:95)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
OK, I just made a new snapshot release fixing the NPE. Please confirm.
Thanks for being a good tester on that one.
Ok, looks like auth isn't being sent at all now. FYI, I'll have to test more tomorrow.
com.norconex.committer.core.CommitterException: Invalid HTTP response: "HTTP/1.1
407 Proxy Authentication Required ( Forefront TMG requires authorization to ful
fill the request. Access to the Web Proxy filter is denied. )". Azure Response:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/
TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>The page cannot be displayed</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<style>
body {
Yes, but you are not setting the username and password, right? In which case it will not be sent. I do not know if HttpClient will try to grab Windows credentials as a fallback in such case. You can try to pass the credentials and you can encrypt the password if that's a concern.
But the proxy settings for the crawl do not require the username and password and are passing the current user without a problem. What is the difference? I'd rather not hard code auth values here, encrypted or not, since that would cause a maintenance problem.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Pascal Essiembre notifications@github.com Sent: Thursday, July 20, 2017 5:03:36 PM To: Norconex/committer-azuresearch Cc: jpcoder2; Author Subject: Re: [Norconex/committer-azuresearch] Error when committing (#1)
Yes, but you are not setting the username and password, right? In which case it will not be sent. I do not know if HttpClient will try to grab Windows credentials as a fallback in such case. You can try to pass the credentials and you can encrypt the password if that's a concern.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Norconex/committer-azuresearch/issues/1#issuecomment-316843029, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACI8n6IL9_xaeOgleW91tUE9ZZycJERgks5sP864gaJpZM4OaRlb.
I can't tell what makes the difference. The code for the crawler is here. If you can spot what makes it different in regards to using your proxy, let me know.
I'm not a java programmer so I may be reading this wrong, but it looks like the crawler is using it's own createCredentialsProvider which seems to return null if now username is given. The committer instead seems to use a createCredentialsProvider method from a com.norconex.commons.lang.net.ProxySettings class. I have not been able to find this class so I'm not sure what could be different there. If you can tell me the location of this code I might be able to research further.
That code is here. I may have a second look myself, but without the same environment I can't reproduce to troubleshoot further.
Even though the code is quite different I'm not seeing a functional difference, but it's hard to tell without stepping thru the code or more logging. Would it be possible to add more logging to the proxy setting and http connection code?
Or if you want to tell me how I could debug it I might be able to do that also. I haven't done any java dev in many years so I don't know what tools are best.
You know what, I think the problem may be that the site I'm crawling may not need proxy authentication (it's a company site, just accessing thru the proxy) and the azure site does need authentication, so neither may actually be passing windows auth automatically. Let me do a little more research.
That would explain. In any case, if you want more logging for connections issues, you can change the log level to DEBUG or even TRACE. I would start by doing it for Apache, changing this line in the log4j.properties file:
log4j.logger.org.apache=DEBUG
But if you expect windows credentials to be passed, it suggests a proxy using NTLM. Apache released a version of its HttpClient library that is more recent than the one shipped with the HTTP Collector and that adds supports for NTLM. More info: https://issues.apache.org/jira/browse/HTTPCLIENT-1779
You may try upgrading that "httpclient" library.
I tried upgrading to the 4.5.3 library and that alone did not seem to change anything. I did find this sample code that seems to state that Integrate Windows Auth can be used (although it doesn't talk about a proxy) but a special Windows version of the httpclient has to be used. What are your thoughts on this? It look like an option?
FYI, changing log4j.logger.org.apache to DEBUG or TRACE seemed to have no effect on log output.
I just tried and I got tons of new entries in the log. Did you change the log4j.properties
located in the Collector install directory? Do you start it using the collector-http.bat
script?
I will have a look at your provided link when I get a chance.
Thanks, but I had to drop the usage of norconex. I found another option that works in my scenario and I have more control of since it's in C#. It's not as built out and flexible as norconex, but I can mold it to what we need. Thanks for all your help though. Feel free to close this issue if noone else needs the proxy auth to work.
No problem, I hope you'll come back when you need it.
It turns out integrating the Win Auth solution you found was fairly simple. I made a snapshot release that supports it with a new <useWindowsAuth>true<useWindowsAuth>
flag. Even if you are no longer actively using it, you can help by testing it if you have a chance (I would try without proxy settings first).
Thanks for your valuable input.
Win Auth is now part of the official release of Azure Search Committer 1.1.0.
Getting the following error when attempting to commit the items to the Azure index. Looks like it is maybe complaining about what is being put in the id field, but I don't know where this is coming from. Help? :)
ERROR [JobSuite] Execution failed for job: Norconex Minimum Test Page com.norconex.committer.core.CommitterException: Document field cannot have one or more characters other than letters, numbers and underscores: document.reference