Closed beviah closed 5 years ago
I think the issue is related to HTTPS, the first answer you get is exactly what is returned when you try to get the page via HTTP instead of HTTPS
$ curl -I -X GET 'http://www.analyticalcannabis.com/news'
HTTP/1.1 301 Moved Permanently
Cache-Control: private
Content-Type: text/html; charset=utf-8
Location: https://www.analyticalcannabis.com/news
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
X-XSS-Protection: 1
AMP-Access-Control-Allow-Source-Origin: http://localhost:62625
X-Powered-By: ASP.NET
Date: Thu, 18 Apr 2019 06:22:45 GMT
Connection: close
Content-Length: 156
note that this is quite common (many sites automatically redirect unsafe URLs to safe ones). What is strange to me is that the WARC-Target-URI
is actually HTTPS, so I'm not sure why (if the crawler correctly requests the HTTPS page) that is the returned content. What I'm afraid of is that the protocol part is stripped from the URL and then a normal HTTP request is made…
I've quite week memory, probably @vigna can shed some light on how the crawler handles HTTPS.
That was my first suspicion, but I checked and HTTPS is in original seeds, so maybe some (de)normalization/reconstruction of URLs is happening, but by checking the code I couldn't find it changing the scheme.
It reminds me of a bug that I though had been solved a long time ago (but you should check).
IIRC the problem was the port used for the actual query, which wasn't 440 but 80. You can check by enabling the logs in org.apache.http.* at least you'll see the actual requests (what is stored in the WARC is not neccessarily the request).
So basically in BUbiNG's code, the request is reconstructed from scheme+authority and path + the cached IP of the host. At some point the host wasn't even transmitted in the request, so there was a bug for multi-host servers.
Guillaume
That was my first suspicion, but I checked and HTTPS is in original seeds, so maybe some (de)normalization/reconstruction of URLs is happening, but by checking the code I couldn't find it changing the scheme.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LAW-Unimi/BUbiNG/issues/19#issuecomment-484379372, or mute the thread https://github.com/notifications/unsubscribe-auth/AALQBU43T3MYE63DWWTU5JLPRALMHANCNFSM4HGZUN6Q.
So, it's a real issue. Even with https, the request is actually issued as http. Yes, it reminds me of that bug, which I thought we had squashed :(.
No, sorry. I was starting the crawl with an old jar. With the current version there's no such problem—the port used is 443, as it should, and the page is downloaded.
Oooook. I completely forgot that we had a Maven entry. Lately I've been updating just the download section of the LAW website. 🙈
My guess is, you're using the Maven version.
Can you try the 0.9.14 version from law.di.unimi.it? I'll update Maven ASAP.
I used binaries from here: http://law.di.unimi.it/software/download/
Maybe related: https://github.com/LAW-Unimi/BUbiNG/blob/master/src/it/unimi/di/law/bubing/frontier/Frontier.java#L457
At the same time, from what I've seen in the code, final redirect destinations are not added to frontier. Is there any specific reason behind this decision?
And to clarify again, redirects happen not only on HTTPS pages, but also legit HTTP pages accessible through wget.
So... the solution is actually simpler than it seems. Since September 2017 (!) I completely forgot to push out new versions as bug fixes (such as Guillaumes's) poured in. It's really embarrassing.
I pushed out 0.9.15 to law.di.unimi.it and Maven. Sorry.
BTW, final redirect destination are added to the frontier.
Thank you guys for rapid responses :)
I explored few possible issues I could think of:
But seems to appear in all variations, as have found URLs of all types in both DNS scenarios, and with and without JCE (which seems to be enabled by default in newer JDKs)
I.e.:
WARC/1.0 WARC-Record-ID:
WARC-Date: 2019-04-18T03:36:37Z
WARC-Target-URI: https://www.analyticalcannabis.com/news
WARC-Type: response
Content-Type: application/http;msgtype=response
WARC-Payload-Digest: bubing:9486e8cdc971e26d6e3c042d0b35f3bb
BUbiNG-Guessed-Charset: utf-8
Content-Length: 546
HTTP/1.1 301 Moved Permanently Cache-Control: private Content-Type: text/html; charset=utf-8 Location: https://www.analyticalcannabis.com/news X-Frame-Options: DENY X-Content-Type-Options: nosniff X-XSS-Protection: 1 AMP-Access-Control-Allow-Source-Origin: http://localhost:62625 X-Powered-By: ASP.NET Date: Thu, 18 Apr 2019 03:36:37 GMT Connection: close Content-Length: 156
Object moved to here.