Closed bobertb closed 7 years ago
From your log, it seems the page was rejected for some reason (REJECTED_BAD_STATUS
). You can change the log level in the log4j.properties
to DEBUG to find out more information about the rejection. E.g.:
log4j.logger.CrawlerEvent.REJECTED_BAD_STATUS=DEBUG
I tried accessing http://libro.coacd.org/index.cfm myself but the site cannot be resolved. Is this a valid URL?
It is rejected with 401 Unauthorized, and it is an internal intranet, so you will not be able to access it.
The Collector is not responding to the 401 with authentication at all, just ACK and FIN ACK. log4j.properties are all at DEBUG.
DEBUG [QueueReferenceStage] Queued for processing: http://libro.coacd.org/index.cfm INFO [HttpCrawler] 1 start URLs identified. INFO [CrawlerEventManager] CRAWLER_STARTED (Subject: com.norconex.collector.http.crawler.HttpCrawler@7ceb3185) INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Crawling references... DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler thread #1 started. DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler thread #2 started. DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: Processing reference: http://libro.coacd.org/index.cfm DEBUG [GenericDocumentFetcher] Fetching document: http://libro.coacd.org/index.cfm DEBUG [GenericDocumentFetcher] Encoded URI: http://libro.coacd.org/index.cfm DEBUG [GenericDocumentFetcher] Unsupported HTTP Response: HTTP/1.1 401 Unauthorized INFO [CrawlerEventManager] REJECTED_BAD_STATUS: http://libro.coacd.org/index.cfm (Subject: HttpFetchResponse [crawlState=BAD_STATUS, statusCode=401, reasonPhrase=Unauthorized]) DEBUG [Pipeline] Pipeline execution stopped at stage: com.norconex.collector.http.pipeline.importer.DocumentFetcherStage@73579aad DEBUG [FileJobStatusStore] Created status file: /opt/collector/./libro-output/complex/progress/latest/status/Libro.coacd.org_32_Test_32_Page_32_1Libro.coacd.org_32_Test_32_Page_32_1.job DEBUG [FileJobStatusStore] Writing status file: /opt/collector/./libro-output/complex/progress/latest/status/Libro.coacd.org_32_Test_32_Page_32_1Libro.coacd.org_32_Test_32_Page_32_1.job DEBUG [FileJobStatusStore] Writing status file: /opt/collector/./libro-output/complex/progress/latest/status/Libro.coacd.org_32_Test_32_Page_32_1__Libro.coacd.org_32_Test_32_Page_32_1.job DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: 00:00:00.215 to process: http://libro.coacd.org/index.cfm INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Deleting orphan references (if any)... INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Deleted 0 orphan references... INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler finishing: committing documents. INFO [AbstractCrawler] Libro.coacd.org Test Page 1: 1 reference(s) processed. DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: Removing empty directories INFO [CrawlerEventManager] CRAWLER_FINISHED (Subject: com.norconex.collector.http.crawler.HttpCrawler@7ceb3185) INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler completed. INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler executed in 1 second. DEBUG [FileJobStatusStore] Writing status file: /opt/collector/./libro-output/complex/progress/latest/status/Libro.coacd.org_32_Test_32_Page_32_1__Libro.coacd.org_32_Test_32_Page_32_1.job INFO [JobSuite] Running Libro.coacd.org Test Page 1: END (Mon Nov 07 10:57:00 CST 2016)
Here is what wire shark sees, and the collector is not responding to the 401 Unauthorized correctly.
No. Time Source Destination Protocol Length Info 104953 745.165453 collector.ip libro.ip TCP 74 55882→80 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=862392138 TSecr=0 WS=128
Frame 104953: 74 bytes on wire (592 bits), 74 bytes captured (592 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 0, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 0 (relative sequence number) Acknowledgment number: 0 Header Length: 40 bytes Flags: 0x002 (SYN) Window size value: 29200 [Calculated window size: 29200] Checksum: 0x10a7 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (20 bytes), Maximum segment size, SACK permitted, Timestamps, No-Operation (NOP), Window scale
No. Time Source Destination Protocol Length Info 104954 745.165570 libro.ip collector.ip TCP 74 80→55882 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1 TSval=2298792493 TSecr=862392138
Frame 104954: 74 bytes on wire (592 bits), 74 bytes captured (592 bits) on interface 0 Ethernet II, Src: Vmware_91:60:34 (00:50:56:91:60:34), Dst: All-HSRP-routers_de (00:00:0c:07:ac:de) Internet Protocol Version 4, Src: libro.ip, Dst: collector.ip Transmission Control Protocol, Src Port: 80, Dst Port: 55882, Seq: 0, Ack: 1, Len: 0 Source Port: 80 Destination Port: 55882 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 0 (relative sequence number) Acknowledgment number: 1 (relative ack number) Header Length: 40 bytes Flags: 0x012 (SYN, ACK) Window size value: 8192 [Calculated window size: 8192] Checksum: 0x3bab [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (20 bytes), Maximum segment size, No-Operation (NOP), Window scale, SACK permitted, Timestamps [SEQ/ACK analysis]
No. Time Source Destination Protocol Length Info 104955 745.166639 collector.ip libro.ip TCP 66 55882→80 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=862392139 TSecr=2298792493
Frame 104955: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 1, Ack: 1, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 1 (relative sequence number) Acknowledgment number: 1 (relative ack number) Header Length: 32 bytes Flags: 0x010 (ACK) Window size value: 229 [Calculated window size: 29312] [Window size scaling factor: 128] Checksum: 0x8992 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis]
No. Time Source Destination Protocol Length Info 104956 745.179676 collector.ip libro.ip HTTP 213 GET /index.cfm HTTP/1.1
Frame 104956: 213 bytes on wire (1704 bits), 213 bytes captured (1704 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 1, Ack: 1, Len: 147 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 147] Sequence number: 1 (relative sequence number) [Next sequence number: 148 (relative sequence number)] Acknowledgment number: 1 (relative ack number) Header Length: 32 bytes Flags: 0x018 (PSH, ACK) Window size value: 229 [Calculated window size: 29312] [Window size scaling factor: 128] Checksum: 0xf637 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis] Hypertext Transfer Protocol
No. Time Source Destination Protocol Length Info 104957 745.179958 libro.ip collector.ip HTTP 274 HTTP/1.1 401 Unauthorized
Frame 104957: 274 bytes on wire (2192 bits), 274 bytes captured (2192 bits) on interface 0 Ethernet II, Src: Vmware_91:60:34 (00:50:56:91:60:34), Dst: All-HSRP-routers_de (00:00:0c:07:ac:de) Internet Protocol Version 4, Src: libro.ip, Dst: collector.ip Transmission Control Protocol, Src Port: 80, Dst Port: 55882, Seq: 1, Ack: 148, Len: 208 Source Port: 80 Destination Port: 55882 [Stream index: 1028] [TCP Segment Len: 208] Sequence number: 1 (relative sequence number) [Next sequence number: 209 (relative sequence number)] Acknowledgment number: 148 (relative ack number) Header Length: 32 bytes Flags: 0x018 (PSH, ACK) Window size value: 514 [Calculated window size: 131584] [Window size scaling factor: 256] Checksum: 0x4569 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis] Hypertext Transfer Protocol
No. Time Source Destination Protocol Length Info 104958 745.181166 collector.ip libro.ip TCP 66 55882→80 [ACK] Seq=148 Ack=209 Win=30336 Len=0 TSval=862392154 TSecr=2298792495
Frame 104958: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 148, Ack: 209, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 148 (relative sequence number) Acknowledgment number: 209 (relative ack number) Header Length: 32 bytes Flags: 0x010 (ACK) Window size value: 237 [Calculated window size: 30336] [Window size scaling factor: 128] Checksum: 0x8816 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis]
No. Time Source Destination Protocol Length Info 104961 745.262477 collector.ip libro.ip TCP 66 55882→80 [FIN, ACK] Seq=148 Ack=209 Win=30336 Len=0 TSval=862392234 TSecr=2298792495
Frame 104961: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 148, Ack: 209, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 148 (relative sequence number) Acknowledgment number: 209 (relative ack number) Header Length: 32 bytes Flags: 0x011 (FIN, ACK) Window size value: 237 [Calculated window size: 30336] [Window size scaling factor: 128] Checksum: 0x87c5 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
No. Time Source Destination Protocol Length Info 104962 745.262574 libro.ip collector.ip TCP 66 80→55882 [FIN, ACK] Seq=209 Ack=149 Win=131584 Len=0 TSval=2298792503 TSecr=862392234
Frame 104962: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: Vmware_91:60:34 (00:50:56:91:60:34), Dst: All-HSRP-routers_de (00:00:0c:07:ac:de) Internet Protocol Version 4, Src: libro.ip, Dst: collector.ip Transmission Control Protocol, Src Port: 80, Dst Port: 55882, Seq: 209, Ack: 149, Len: 0 Source Port: 80 Destination Port: 55882 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 209 (relative sequence number) Acknowledgment number: 149 (relative ack number) Header Length: 32 bytes Flags: 0x011 (FIN, ACK) Window size value: 514 [Calculated window size: 131584] [Window size scaling factor: 256] Checksum: 0x86a7 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis]
No. Time Source Destination Protocol Length Info 104963 745.263987 collector.ip libro.ip TCP 66 55882→80 [ACK] Seq=149 Ack=210 Win=30336 Len=0 TSval=862392236 TSecr=2298792503
Frame 104963: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 149, Ack: 210, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 149 (relative sequence number) Acknowledgment number: 210 (relative ack number) Header Length: 32 bytes Flags: 0x010 (ACK) Window size value: 237 [Calculated window size: 30336] [Window size scaling factor: 128] Checksum: 0x87ba [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis]
Thanks, Bob
Is your site really using BASIC authentication? That is, the browser-supplied popup? I just tried this method on a test site and it works fine. Not all sites require you to specify the host/name/port so you may want to try without them to see if it makes a difference. So just try with this:
<httpClientFactory>
<authMethod>basic</authMethod>
<authUsername>user</authUsername>
<authPassword>password</authPassword>
</httpClientFactory>
Yes, it uses Basic and/or NTLM which I can verify via wget and curl. Yes, the browser will prompt you for a login. I used your settings one above. Same result. The collector is not responding to the http/1.1 401 Unauthorized packet it gets from the web server at all, other than to ACK and ACK/FIN. It does not negotiate, or respond correctly. Here is the SYN, GET, and 401's.. I striped out all the ACK's, but it is clearly not sending any Auth info in the GET.
No. Time Source Destination Protocol Length Info 15928 17.675463 COLLECTOR.IP LIBRO.IP TCP 74 55940→80 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=877900418 TSecr=0 WS=128
Frame 15928: 74 bytes on wire (592 bits), 74 bytes captured (592 bits) on interface 0 Ethernet II, Src: CiscoInc_23:91:c1 (40:55:39:23:91:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: COLLECTOR.IP, Dst: LIBRO.IP Transmission Control Protocol, Src Port: 55940, Dst Port: 80, Seq: 0, Len: 0 Source Port: 55940 Destination Port: 80 [Stream index: 120] [TCP Segment Len: 0] Sequence number: 0 (relative sequence number) Acknowledgment number: 0 Header Length: 40 bytes Flags: 0x002 (SYN)
No. Time Source Destination Protocol Length Info 15931 17.683053 COLLECTOR.IP LIBRO.IP HTTP 213 GET /index.cfm HTTP/1.1
Frame 15931: 213 bytes on wire (1704 bits), 213 bytes captured (1704 bits) on interface 0 Ethernet II, Src: CiscoInc_23:91:c1 (40:55:39:23:91:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: COLLECTOR.IP, Dst: LIBRO.IP Transmission Control Protocol, Src Port: 55940, Dst Port: 80, Seq: 1, Ack: 1, Len: 147 Source Port: 55940 Destination Port: 80 [Stream index: 120] [TCP Segment Len: 147] Sequence number: 1 (relative sequence number) [Next sequence number: 148 (relative sequence number)] Acknowledgment number: 1 (relative ack number) Header Length: 32 bytes Flags: 0x018 (PSH, ACK)
No. Time Source Destination Protocol Length Info 15932 17.694951 LIBRO.IP COLLECTOR.IP HTTP 274 HTTP/1.1 401 Unauthorized
Frame 15932: 274 bytes on wire (2192 bits), 274 bytes captured (2192 bits) on interface 0 Ethernet II, Src: Vmware_91:60:34 (00:50:56:91:60:34), Dst: All-HSRP-routers_de (00:00:0c:07:ac:de) Internet Protocol Version 4, Src: LIBRO.IP, Dst: COLLECTOR.IP Transmission Control Protocol, Src Port: 80, Dst Port: 55940, Seq: 1, Ack: 148, Len: 208 Source Port: 80 Destination Port: 55940 [Stream index: 120] [TCP Segment Len: 208] Sequence number: 1 (relative sequence number) [Next sequence number: 209 (relative sequence number)] Acknowledgment number: 148 (relative ack number) Header Length: 32 bytes Flags: 0x018 (PSH, ACK)
From what you pasted, it looks like you are not using BASIC authentication, but rather NTLM:
WWW-Authenticate: NTLM\r\n
As described in GenericHttpClientFactory documentation, NTLM requires that you also specify these two:
<authWorkstation>...</authWorkstation>
<authDomain>...</authDomain>
I hope this will work for you. Without being able to reproduce, it is hard to help further. HTTP Collector uses Apache HttpClient to perform authentication. Maybe you can research on Apache website for more insights (e.g. https://hc.apache.org/httpcomponents-client-ga/ntlm.html). You can also contact Norconex to get hands on assistance on your intranet.
even though basic is enabled, and working. I am happy with getting NTLM to work !!
<httpClientFactory>
<authMethod>ntlm</authMethod>
<authUsername>[domain user]</authUsername>
<authPassword>[domain pwd]</authPassword>
<authRealm>[iis server url]</authRealm>
<authWorkstation>[collector ip]</authWorkstation>
<authDomain> [fqdn of domain]</authDomain>
</httpClientFactory>
Thanks,
Yay !! Now off to figure out how to inject this into solr !!! Any hints ?
Bob
Great! For use with Solr, download and install the Solr Committer. Follow the install instructions on the site.
I am attempting to use the httpClientFactory in a complex-config.xml and I am not seeing any attempt by the crawler to authenticate at the server. I have wireshark running, and it is just doing the following:
crawler => site SYN site => crawler SYN crawler => site ACK crawler => site GET site => crawler 401 crawler => site ACK crawler => site FIN,ACK site => crawler FIN,ACK crawler => site ACK
<httpClientFactory">