Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

httpClientFactory auth basic is not attempting to auth ? #307

Closed bobertb closed 7 years ago

bobertb commented 7 years ago

I am attempting to use the httpClientFactory in a complex-config.xml and I am not seeing any attempt by the crawler to authenticate at the server. I have wireshark running, and it is just doing the following:

crawler => site SYN site => crawler SYN crawler => site ACK crawler => site GET site => crawler 401 crawler => site ACK crawler => site FIN,ACK site => crawler FIN,ACK crawler => site ACK

<httpClientFactory">

basic user password domain.org 80 domain.org [root@localhost collector]# ./collector-http.sh -a start -c libro/complex/complex-config.xml INFO [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,caseSensitive=false] INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=http://libro\.coacd\.org/.*] INFO [AbstractCollectorConfig] Configuration loaded: id=Norconex Complex Collector; logsDir=./libro-output/complex/logs; progressDir=./libro-output/complex/progress INFO [JobSuite] JEF work directory is: ./libro-output/complex/progress INFO [JobSuite] JEF log manager is : FileLogManager INFO [JobSuite] JEF job status store is : FileJobStatusStore INFO [AbstractCollector] Suite of 1 crawler jobs created. INFO [JobSuite] Initialization... INFO [JobSuite] Previous execution detected. INFO [JobSuite] Backing up previous execution status and log files. INFO [JobSuite] Starting execution. INFO [AbstractCollector] Version: Norconex HTTP Collector 2.6.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Collector Core 1.6.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Importer 2.6.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Committer Core 2.0.5 (Norconex Inc.) INFO [JobSuite] Running Norconex Complex Libro Test Page 1: BEGIN (Fri Nov 04 14:04:48 CDT 2016) INFO [HttpCrawler] Norconex Complex Libro Test Page 1: RobotsTxt support: false INFO [HttpCrawler] Norconex Complex Libro Test Page 1: RobotsMeta support: false INFO [HttpCrawler] Norconex Complex Libro Test Page 1: Sitemap support: false INFO [HttpCrawler] Norconex Complex Libro Test Page 1: Canonical links support: true INFO [HttpCrawler] Norconex Complex Libro Test Page 1: User-Agent: APL-Norconex-Collector Agent INFO [SitemapStore] Norconex Complex Libro Test Page 1: Initializing sitemap store... INFO [SitemapStore] Norconex Complex Libro Test Page 1: Done initializing sitemap store. INFO [HttpCrawler] 1 start URLs identified. INFO [CrawlerEventManager] CRAWLER_STARTED INFO [AbstractCrawler] Norconex Complex Libro Test Page 1: Crawling references... INFO [CrawlerEventManager] REJECTED_BAD_STATUS: http://libro.coacd.org/index.cfm INFO [AbstractCrawler] Norconex Complex Libro Test Page 1: Deleting orphan references (if any)... INFO [AbstractCrawler] Norconex Complex Libro Test Page 1: Deleted 0 orphan references... INFO [AbstractCrawler] Norconex Complex Libro Test Page 1: Crawler finishing: committing documents. INFO [AbstractCrawler] Norconex Complex Libro Test Page 1: 1 reference(s) processed. INFO [CrawlerEventManager] CRAWLER_FINISHED INFO [AbstractCrawler] Norconex Complex Libro Test Page 1: Crawler completed. INFO [AbstractCrawler] Norconex Complex Libro Test Page 1: Crawler executed in 1 second. INFO [JobSuite] Running Norconex Complex Libro Test Page 1: END (Fri Nov 04 14:04:48 CDT 2016) [root@localhost collector]# Any help appreciated. Bob
essiembre commented 7 years ago

From your log, it seems the page was rejected for some reason (REJECTED_BAD_STATUS). You can change the log level in the log4j.properties to DEBUG to find out more information about the rejection. E.g.:

log4j.logger.CrawlerEvent.REJECTED_BAD_STATUS=DEBUG

I tried accessing http://libro.coacd.org/index.cfm myself but the site cannot be resolved. Is this a valid URL?

bobertb commented 7 years ago

It is rejected with 401 Unauthorized, and it is an internal intranet, so you will not be able to access it.
The Collector is not responding to the 401 with authentication at all, just ACK and FIN ACK. log4j.properties are all at DEBUG.

DEBUG [QueueReferenceStage] Queued for processing: http://libro.coacd.org/index.cfm INFO [HttpCrawler] 1 start URLs identified. INFO [CrawlerEventManager] CRAWLER_STARTED (Subject: com.norconex.collector.http.crawler.HttpCrawler@7ceb3185) INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Crawling references... DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler thread #1 started. DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler thread #2 started. DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: Processing reference: http://libro.coacd.org/index.cfm DEBUG [GenericDocumentFetcher] Fetching document: http://libro.coacd.org/index.cfm DEBUG [GenericDocumentFetcher] Encoded URI: http://libro.coacd.org/index.cfm DEBUG [GenericDocumentFetcher] Unsupported HTTP Response: HTTP/1.1 401 Unauthorized INFO [CrawlerEventManager] REJECTED_BAD_STATUS: http://libro.coacd.org/index.cfm (Subject: HttpFetchResponse [crawlState=BAD_STATUS, statusCode=401, reasonPhrase=Unauthorized]) DEBUG [Pipeline] Pipeline execution stopped at stage: com.norconex.collector.http.pipeline.importer.DocumentFetcherStage@73579aad DEBUG [FileJobStatusStore] Created status file: /opt/collector/./libro-output/complex/progress/latest/status/Libro.coacd.org_32_Test_32_Page_32_1Libro.coacd.org_32_Test_32_Page_32_1.job DEBUG [FileJobStatusStore] Writing status file: /opt/collector/./libro-output/complex/progress/latest/status/Libro.coacd.org_32_Test_32_Page_32_1Libro.coacd.org_32_Test_32_Page_32_1.job DEBUG [FileJobStatusStore] Writing status file: /opt/collector/./libro-output/complex/progress/latest/status/Libro.coacd.org_32_Test_32_Page_32_1__Libro.coacd.org_32_Test_32_Page_32_1.job DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: 00:00:00.215 to process: http://libro.coacd.org/index.cfm INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Deleting orphan references (if any)... INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Deleted 0 orphan references... INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler finishing: committing documents. INFO [AbstractCrawler] Libro.coacd.org Test Page 1: 1 reference(s) processed. DEBUG [AbstractCrawler] Libro.coacd.org Test Page 1: Removing empty directories INFO [CrawlerEventManager] CRAWLER_FINISHED (Subject: com.norconex.collector.http.crawler.HttpCrawler@7ceb3185) INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler completed. INFO [AbstractCrawler] Libro.coacd.org Test Page 1: Crawler executed in 1 second. DEBUG [FileJobStatusStore] Writing status file: /opt/collector/./libro-output/complex/progress/latest/status/Libro.coacd.org_32_Test_32_Page_32_1__Libro.coacd.org_32_Test_32_Page_32_1.job INFO [JobSuite] Running Libro.coacd.org Test Page 1: END (Mon Nov 07 10:57:00 CST 2016)

Here is what wire shark sees, and the collector is not responding to the 401 Unauthorized correctly.

No. Time Source Destination Protocol Length Info 104953 745.165453 collector.ip libro.ip TCP 74 55882→80 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=862392138 TSecr=0 WS=128

Frame 104953: 74 bytes on wire (592 bits), 74 bytes captured (592 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 0, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 0 (relative sequence number) Acknowledgment number: 0 Header Length: 40 bytes Flags: 0x002 (SYN) Window size value: 29200 [Calculated window size: 29200] Checksum: 0x10a7 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (20 bytes), Maximum segment size, SACK permitted, Timestamps, No-Operation (NOP), Window scale

No. Time Source Destination Protocol Length Info 104954 745.165570 libro.ip collector.ip TCP 74 80→55882 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1 TSval=2298792493 TSecr=862392138

Frame 104954: 74 bytes on wire (592 bits), 74 bytes captured (592 bits) on interface 0 Ethernet II, Src: Vmware_91:60:34 (00:50:56:91:60:34), Dst: All-HSRP-routers_de (00:00:0c:07:ac:de) Internet Protocol Version 4, Src: libro.ip, Dst: collector.ip Transmission Control Protocol, Src Port: 80, Dst Port: 55882, Seq: 0, Ack: 1, Len: 0 Source Port: 80 Destination Port: 55882 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 0 (relative sequence number) Acknowledgment number: 1 (relative ack number) Header Length: 40 bytes Flags: 0x012 (SYN, ACK) Window size value: 8192 [Calculated window size: 8192] Checksum: 0x3bab [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (20 bytes), Maximum segment size, No-Operation (NOP), Window scale, SACK permitted, Timestamps [SEQ/ACK analysis]

No. Time Source Destination Protocol Length Info 104955 745.166639 collector.ip libro.ip TCP 66 55882→80 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=862392139 TSecr=2298792493

Frame 104955: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 1, Ack: 1, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 1 (relative sequence number) Acknowledgment number: 1 (relative ack number) Header Length: 32 bytes Flags: 0x010 (ACK) Window size value: 229 [Calculated window size: 29312] [Window size scaling factor: 128] Checksum: 0x8992 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis]

No. Time Source Destination Protocol Length Info 104956 745.179676 collector.ip libro.ip HTTP 213 GET /index.cfm HTTP/1.1

Frame 104956: 213 bytes on wire (1704 bits), 213 bytes captured (1704 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 1, Ack: 1, Len: 147 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 147] Sequence number: 1 (relative sequence number) [Next sequence number: 148 (relative sequence number)] Acknowledgment number: 1 (relative ack number) Header Length: 32 bytes Flags: 0x018 (PSH, ACK) Window size value: 229 [Calculated window size: 29312] [Window size scaling factor: 128] Checksum: 0xf637 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis] Hypertext Transfer Protocol

No. Time Source Destination Protocol Length Info 104957 745.179958 libro.ip collector.ip HTTP 274 HTTP/1.1 401 Unauthorized

Frame 104957: 274 bytes on wire (2192 bits), 274 bytes captured (2192 bits) on interface 0 Ethernet II, Src: Vmware_91:60:34 (00:50:56:91:60:34), Dst: All-HSRP-routers_de (00:00:0c:07:ac:de) Internet Protocol Version 4, Src: libro.ip, Dst: collector.ip Transmission Control Protocol, Src Port: 80, Dst Port: 55882, Seq: 1, Ack: 148, Len: 208 Source Port: 80 Destination Port: 55882 [Stream index: 1028] [TCP Segment Len: 208] Sequence number: 1 (relative sequence number) [Next sequence number: 209 (relative sequence number)] Acknowledgment number: 148 (relative ack number) Header Length: 32 bytes Flags: 0x018 (PSH, ACK) Window size value: 514 [Calculated window size: 131584] [Window size scaling factor: 256] Checksum: 0x4569 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis] Hypertext Transfer Protocol

No. Time Source Destination Protocol Length Info 104958 745.181166 collector.ip libro.ip TCP 66 55882→80 [ACK] Seq=148 Ack=209 Win=30336 Len=0 TSval=862392154 TSecr=2298792495

Frame 104958: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 148, Ack: 209, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 148 (relative sequence number) Acknowledgment number: 209 (relative ack number) Header Length: 32 bytes Flags: 0x010 (ACK) Window size value: 237 [Calculated window size: 30336] [Window size scaling factor: 128] Checksum: 0x8816 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis]

No. Time Source Destination Protocol Length Info 104961 745.262477 collector.ip libro.ip TCP 66 55882→80 [FIN, ACK] Seq=148 Ack=209 Win=30336 Len=0 TSval=862392234 TSecr=2298792495

Frame 104961: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 148, Ack: 209, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 148 (relative sequence number) Acknowledgment number: 209 (relative ack number) Header Length: 32 bytes Flags: 0x011 (FIN, ACK) Window size value: 237 [Calculated window size: 30336] [Window size scaling factor: 128] Checksum: 0x87c5 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps

No. Time Source Destination Protocol Length Info 104962 745.262574 libro.ip collector.ip TCP 66 80→55882 [FIN, ACK] Seq=209 Ack=149 Win=131584 Len=0 TSval=2298792503 TSecr=862392234

Frame 104962: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: Vmware_91:60:34 (00:50:56:91:60:34), Dst: All-HSRP-routers_de (00:00:0c:07:ac:de) Internet Protocol Version 4, Src: libro.ip, Dst: collector.ip Transmission Control Protocol, Src Port: 80, Dst Port: 55882, Seq: 209, Ack: 149, Len: 0 Source Port: 80 Destination Port: 55882 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 209 (relative sequence number) Acknowledgment number: 149 (relative ack number) Header Length: 32 bytes Flags: 0x011 (FIN, ACK) Window size value: 514 [Calculated window size: 131584] [Window size scaling factor: 256] Checksum: 0x86a7 [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis]

No. Time Source Destination Protocol Length Info 104963 745.263987 collector.ip libro.ip TCP 66 55882→80 [ACK] Seq=149 Ack=210 Win=30336 Len=0 TSval=862392236 TSecr=2298792503

Frame 104963: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0 Ethernet II, Src: CiscoInc_24:51:c1 (40:55:39:24:51:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: collector.ip, Dst: libro.ip Transmission Control Protocol, Src Port: 55882, Dst Port: 80, Seq: 149, Ack: 210, Len: 0 Source Port: 55882 Destination Port: 80 [Stream index: 1028] [TCP Segment Len: 0] Sequence number: 149 (relative sequence number) Acknowledgment number: 210 (relative ack number) Header Length: 32 bytes Flags: 0x010 (ACK) Window size value: 237 [Calculated window size: 30336] [Window size scaling factor: 128] Checksum: 0x87ba [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps [SEQ/ACK analysis]

Thanks, Bob

essiembre commented 7 years ago

Is your site really using BASIC authentication? That is, the browser-supplied popup? I just tried this method on a test site and it works fine. Not all sites require you to specify the host/name/port so you may want to try without them to see if it makes a difference. So just try with this:

<httpClientFactory>
  <authMethod>basic</authMethod>
  <authUsername>user</authUsername>
  <authPassword>password</authPassword>
</httpClientFactory>
bobertb commented 7 years ago

Yes, it uses Basic and/or NTLM which I can verify via wget and curl. Yes, the browser will prompt you for a login. I used your settings one above. Same result. The collector is not responding to the http/1.1 401 Unauthorized packet it gets from the web server at all, other than to ACK and ACK/FIN. It does not negotiate, or respond correctly. Here is the SYN, GET, and 401's.. I striped out all the ACK's, but it is clearly not sending any Auth info in the GET.

No. Time Source Destination Protocol Length Info 15928 17.675463 COLLECTOR.IP LIBRO.IP TCP 74 55940→80 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=877900418 TSecr=0 WS=128

Frame 15928: 74 bytes on wire (592 bits), 74 bytes captured (592 bits) on interface 0 Ethernet II, Src: CiscoInc_23:91:c1 (40:55:39:23:91:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: COLLECTOR.IP, Dst: LIBRO.IP Transmission Control Protocol, Src Port: 55940, Dst Port: 80, Seq: 0, Len: 0 Source Port: 55940 Destination Port: 80 [Stream index: 120] [TCP Segment Len: 0] Sequence number: 0 (relative sequence number) Acknowledgment number: 0 Header Length: 40 bytes Flags: 0x002 (SYN)

  1. .... .... = Reserved: Not set ...0 .... .... = Nonce: Not set .... 0... .... = Congestion Window Reduced (CWR): Not set .... .0.. .... = ECN-Echo: Not set .... ..0. .... = Urgent: Not set .... ...0 .... = Acknowledgment: Not set .... .... 0... = Push: Not set .... .... .0.. = Reset: Not set .... .... ..1. = Syn: Set [Expert Info (Chat/Sequence): Connection establish request (SYN): server port 80] [Connection establish request (SYN): server port 80] [Severity level: Chat] [Group: Sequence] .... .... ...0 = Fin: Not set [TCP Flags: ··········S·] Window size value: 29200 [Calculated window size: 29200] Checksum: 0xf8be [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (20 bytes), Maximum segment size, SACK permitted, Timestamps, No-Operation (NOP), Window scale Maximum segment size: 1460 bytes TCP SACK Permitted Option: True Timestamps: TSval 877900418, TSecr 0 No-Operation (NOP) Window scale: 7 (multiply by 128) Kind: Window Scale (3) Length: 3 Shift count: 7 [Multiplier: 128]

No. Time Source Destination Protocol Length Info 15931 17.683053 COLLECTOR.IP LIBRO.IP HTTP 213 GET /index.cfm HTTP/1.1

Frame 15931: 213 bytes on wire (1704 bits), 213 bytes captured (1704 bits) on interface 0 Ethernet II, Src: CiscoInc_23:91:c1 (40:55:39:23:91:c1), Dst: Vmware_91:60:34 (00:50:56:91:60:34) Internet Protocol Version 4, Src: COLLECTOR.IP, Dst: LIBRO.IP Transmission Control Protocol, Src Port: 55940, Dst Port: 80, Seq: 1, Ack: 1, Len: 147 Source Port: 55940 Destination Port: 80 [Stream index: 120] [TCP Segment Len: 147] Sequence number: 1 (relative sequence number) [Next sequence number: 148 (relative sequence number)] Acknowledgment number: 1 (relative ack number) Header Length: 32 bytes Flags: 0x018 (PSH, ACK)

  1. .... .... = Reserved: Not set ...0 .... .... = Nonce: Not set .... 0... .... = Congestion Window Reduced (CWR): Not set .... .0.. .... = ECN-Echo: Not set .... ..0. .... = Urgent: Not set .... ...1 .... = Acknowledgment: Set .... .... 1... = Push: Set .... .... .0.. = Reset: Not set .... .... ..0. = Syn: Not set .... .... ...0 = Fin: Not set [TCP Flags: ·······AP···] Window size value: 229 [Calculated window size: 29312] [Window size scaling factor: 128] Checksum: 0xbaae [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps No-Operation (NOP) No-Operation (NOP) Timestamps: TSval 877900425, TSecr 2300343256 [SEQ/ACK analysis] Hypertext Transfer Protocol GET /index.cfm HTTP/1.1\r\n [Expert Info (Chat/Sequence): GET /index.cfm HTTP/1.1\r\n] [GET /index.cfm HTTP/1.1\r\n] [Severity level: Chat] [Group: Sequence] Request Method: GET Request URI: /index.cfm Request Version: HTTP/1.1 Host: libro.coacd.org\r\n Connection: Keep-Alive\r\n User-Agent: APL-Norconex-Collector Agent\r\n Accept-Encoding: gzip,deflate\r\n \r\n [Full request URI: http://libro.coacd.org/index.cfm] [HTTP request 1/1] [Response in frame: 15932]

No. Time Source Destination Protocol Length Info 15932 17.694951 LIBRO.IP COLLECTOR.IP HTTP 274 HTTP/1.1 401 Unauthorized

Frame 15932: 274 bytes on wire (2192 bits), 274 bytes captured (2192 bits) on interface 0 Ethernet II, Src: Vmware_91:60:34 (00:50:56:91:60:34), Dst: All-HSRP-routers_de (00:00:0c:07:ac:de) Internet Protocol Version 4, Src: LIBRO.IP, Dst: COLLECTOR.IP Transmission Control Protocol, Src Port: 80, Dst Port: 55940, Seq: 1, Ack: 148, Len: 208 Source Port: 80 Destination Port: 55940 [Stream index: 120] [TCP Segment Len: 208] Sequence number: 1 (relative sequence number) [Next sequence number: 209 (relative sequence number)] Acknowledgment number: 148 (relative ack number) Header Length: 32 bytes Flags: 0x018 (PSH, ACK)

  1. .... .... = Reserved: Not set ...0 .... .... = Nonce: Not set .... 0... .... = Congestion Window Reduced (CWR): Not set .... .0.. .... = ECN-Echo: Not set .... ..0. .... = Urgent: Not set .... ...1 .... = Acknowledgment: Set .... .... 1... = Push: Set .... .... .0.. = Reset: Not set .... .... ..0. = Syn: Not set .... .... ...0 = Fin: Not set [TCP Flags: ·······AP···] Window size value: 514 [Calculated window size: 131584] [Window size scaling factor: 256] Checksum: 0x0adf [unverified] [Checksum Status: Unverified] Urgent pointer: 0 Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps No-Operation (NOP) No-Operation (NOP) Timestamps: TSval 2300343258, TSecr 877900425 [SEQ/ACK analysis] Hypertext Transfer Protocol HTTP/1.1 401 Unauthorized\r\n [Expert Info (Chat/Sequence): HTTP/1.1 401 Unauthorized\r\n] [HTTP/1.1 401 Unauthorized\r\n] [Severity level: Chat] [Group: Sequence] Request Version: HTTP/1.1 Status Code: 401 Response Phrase: Unauthorized Server: Microsoft-IIS/7.5\r\n WWW-Authenticate: NTLM\r\n WWW-Authenticate: Basic realm="libro.coacd.org"\r\n X-Powered-By: ASP.NET\r\n Date: Mon, 07 Nov 2016 21:15:29 GMT\r\n Content-Length: 0\r\n \r\n [HTTP response 1/1] [Time since request: 0.011898000 seconds] [Request in frame: 15931]
essiembre commented 7 years ago

From what you pasted, it looks like you are not using BASIC authentication, but rather NTLM:

WWW-Authenticate: NTLM\r\n

As described in GenericHttpClientFactory documentation, NTLM requires that you also specify these two:

      <authWorkstation>...</authWorkstation>
      <authDomain>...</authDomain>

I hope this will work for you. Without being able to reproduce, it is hard to help further. HTTP Collector uses Apache HttpClient to perform authentication. Maybe you can research on Apache website for more insights (e.g. https://hc.apache.org/httpcomponents-client-ga/ntlm.html). You can also contact Norconex to get hands on assistance on your intranet.

bobertb commented 7 years ago

even though basic is enabled, and working. I am happy with getting NTLM to work !!

<httpClientFactory> <authMethod>ntlm</authMethod> <authUsername>[domain user]</authUsername> <authPassword>[domain pwd]</authPassword> <authRealm>[iis server url]</authRealm> <authWorkstation>[collector ip]</authWorkstation> <authDomain> [fqdn of domain]</authDomain> </httpClientFactory>

Thanks,

Yay !! Now off to figure out how to inject this into solr !!! Any hints ?

Bob

essiembre commented 7 years ago

Great! For use with Solr, download and install the Solr Committer. Follow the install instructions on the site.