cubiclesoft / ultimate-web-scraper

A PHP library/toolkit designed to handle all of your web scraping needs under a MIT or LGPL license. Also has web server and WebSocket server classes for building custom servers.
448 stars 113 forks source link

error when trying to access https, problem with SSL connection #38

Closed syltrinket closed 1 year ago

syltrinket commented 1 year ago

At my organization, we work a lot with prisoners in Texas, and because they are transferred fairly frequently, it helps to be able to click a button to retrieve their current location within our contract tracking software we use.

I used to do this with cURL, but something changed at the site that is disallowing a connection. This is an example of a random person's page: https://inmate.tdcj.texas.gov/InmateSearch/viewDetail.action?sid=08195689. We just grab the "current facility" name. However, now I can't even access the main page here with cURL: https://www.tdcj.texas.gov/.

I'm using PHP 7.4, and this is the code I am trying with ultimate-web-scraper, from the examples:

<?php
  define('__MEMROOT__', 'xxxxxxxxxxxxxxxxxxxxx/'); // get the root to the members directory
  require_once(__MEMROOT__ . "uws/http.php");
  require_once(__MEMROOT__ . "uws/web_browser.php");

  $sslopts = HTTP::GetSafeSSLOpts(true, "modern"); // not sure what else might need to be set
  $sslopts["capture_peer_cert"] = true;
  function CertCheckCallback($type, $cert, $opts) {
    var_dump($type);
    var_dump($cert);
    return true;
  }

  //$url = "https://inmate.tdcj.texas.gov/InmateSearch/viewDetail.action?sid=08195689"; // sample person's page
  $url = "https://www.tdcj.texas.gov/"; // home page
  $web = new WebBrowser();
  $options = array(
    "sslopts" => $sslopts,
    "peer_cert_callback" => "CertCheckCallback",
    "peer_cert_callback_opts" => false,
    "postvars" => array(            // I know this is not right, but could not get to function without this;
      "id" => 12345,                   //  have not looked at how POST options work yet
      "firstname" => "John",      // the pages we access do not require any login info
      "lastname" => "Smith"
    )
  );
  $result = $web->Process($url, $options);

  if (!$result["success"]) {
    echo "Error retrieving URL.  " . $result["error"] . "\n";
    exit();
  }

  if ($result["response"]["code"] != 200) {
    echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
    exit();
  }

  // Get the final URL after redirects.
  $baseurl = $result["url"];

  echo $baseurl;
  return;
?>

The response I get is: "Error retrieving URL. Unable to retrieve content. Unable to establish a connection to 'ssl://www.tdcj.texas.gov:443'."

I'm sorry if this is a really basic question, but I'm not very familiar with scraping and how to address problems like this. I have little use for collecting masses of data, just small needs to help make things easier for our volunteers.

thanks for any pointers you might be able to provide!

cubiclesoft commented 1 year ago

According to SSL Labs, that web server is presenting an incomplete certificate chain:

This server's certificate chain is incomplete. Grade capped to B.

So what is happening is that your web browser sees an incomplete certificate chain and downloads the missing intermediate and then validates the certificate chain to the root cert. This library, cURL, and PHP itself won't do that. They actually have a misconfigured web server, which might be intentional (i.e. to prevent web scrapers from functioning).

The quick fix is to disable peer verification of the certificate chain:

<?php
    $rootpath = str_replace("\\", "/", dirname(__FILE__));

    require_once $rootpath . "/support/http.php";
    require_once $rootpath . "/support/web_browser.php";

    $url = "https://www.tdcj.texas.gov/";
    $web = new WebBrowser();

    $options = array(
        "sslopts" => HTTP::GetSafeSSLOpts(),
        "debug" => true
    );

    // This is, in general, a bad idea.
    $options["sslopts"]["verify_peer"] = false;

    $result = $web->Process($url, $options);
var_dump($result);
?>

To fix this the "right way," you need to download both the root and the intermediate certificates for the domain as a single PEM file and then reference the PEM file in your code.

I've uploaded the appropriate file here as exported from Firefox: tdcj-texas-gov-chain.zip

<?php
    $rootpath = str_replace("\\", "/", dirname(__FILE__));

    require_once $rootpath . "/support/http.php";
    require_once $rootpath . "/support/web_browser.php";

    $url = "https://www.tdcj.texas.gov/";
    $web = new WebBrowser();

    $options = array(
        "sslopts" => HTTP::GetSafeSSLOpts($rootpath . "/tdcj-texas-gov-chain.pem"),
        "debug" => true
    );

    $result = $web->Process($url, $options);
var_dump($result);
?>

I was stumped for a while on this one. Diagnosing SSL/TLS issues is difficult. Turning on debugging will usually dump some useful error output. SSL Labs is useful for certain problems. But it is safe to say that it is the fault of whoever is managing that system. Law enforcement and judicial entities are notorious for not being particularly transparent. They put up digital walls to block access to important information. In most cases, those digital walls actually directly violate the written law. This could simply be a server misconfiguration - it happens from time to time - but given the tendencies of judicial entities, it is just as likely an attempt to be intentionally malicious to outside entities, especially media outlets.

You can do similar things with cURL (disable peer verification, use alternate certificate chains, etc). I prefer my own toolkit as it is the Swiss army knife of web scraping software and implements the HTTP protocol directly with lots of debugging support. PHP's built-in facilities for diagnosing SSL/TLS issues are a bit on the weak side of things regardless of what you use.

syltrinket commented 1 year ago

First, thanks for the quick response! And I agree it is likely the prison system intentionally munging things up to make it difficult to get data. They do so many things to screw with monitoring efforts and advocacy.

Embarrassing typo: we don't have any "contracts" with prisons, only inside contacts.

However, I'm still having the same issue, "unable to establish a connection." If this worked for you, but not for me, does that mean it could be something related to my setup? We are running on a HostGator shared business plan. I can also make some changes to PHP if that might be an issue.

I tried first using your PEM file, then with your info learned how (I think) to build my own (thanks as I had no idea about how to do that) and tried different iterations. I built mine from Chrome (Version 114.0.5735.248 64-bit), and then tried different ways of doing it (with and without the site cert, and in root=>intermediate=>site order and reversed), but all have the same unable to establish connection. Also tried Firefox portable and got the same message.

This is the code I'm using (based on what was posted above):

<?php
  define('__MEMROOT__', 'xxxxxxxxxxxxxxxx/');
  require_once(__MEMROOT__ . "uws/http.php");
  require_once(__MEMROOT__ . "uws/web_browser.php");

  $url = "https://inmate.tdcj.texas.gov/InmateSearch/viewDetail.action?sid=04323857";
  //$url = "https://www.tdcj.texas.gov/";
  //$url = "https://google.com/";

  $web = new WebBrowser();

  $options = array(
    "sslopts" => HTTP::GetSafeSSLOpts(__MEMROOT__ . "uws/tdcj-texas-gov-chain.pem"), // this is the provided PEM file
    //"sslopts" => HTTP::GetSafeSSLOpts(__MEMROOT__ . "uws/inmate-tdcj-chain.pem"), // this is the PEM file I made
    "debug" => true
  );

  //$options["sslopts"]["verify_peer"] = false;

  $result = $web->Process($url, $options);
  var_dump($result);
  return;  
?>

And this is the output, formatted to be easier to read:

array(5) { 
  ["success"]=> bool(false) 
  ["error"]=> string(99) "Unable to retrieve content. Unable to establish a connection to 'ssl://inmate.tdcj.texas.gov:443'." 
  ["info"]=> array(10) {
    ["success"]=> bool(false) 
    ["error"]=> string(70) "Unable to establish a connection to 'ssl://inmate.tdcj.texas.gov:443'." 
    ["info"]=> string(26) "Connection timed out (110)" 
    ["errorcode"]=> string(14) "connect_failed" 
    ["url"]=> string(73) "https://inmate.tdcj.texas.gov/InmateSearch/viewDetail.action?sid=04323857" 
    ["options"]=> array(4) { 
      ["headers"]=> array(4) { 
        ["Accept"]=> string(37) "text/html, application/xhtml+xml, */*" 
        ["Accept-Language"]=> string(14) "en-us,en;q=0.5" 
        ["Cache-Control"]=> string(9) "max-age=0" 
        ["User-Agent"]=> string(78) "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0" 
      } 
      ["sslopts"]=> array(7) { 
        ["ciphers"]=> string(587) "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA256:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS" 
        ["disable_compression"]=> bool(true) 
        ["allow_self_signed"]=> bool(false) 
        ["verify_peer"]=> bool(true) 
        ["verify_depth"]=> int(5) 
        ["SNI_enabled"]=> bool(true) 
        ["cafile"]=> string(56) "/xxxxxxxxxxxx/uws/tdcj-texas-gov-chain.pem" 
      } 
      ["debug"]=> bool(true) 
      ["streamtimeout"]=> int(300) 
    } 
    ["firstreqts"]=> float(1690051470.7274) 
    ["numredirects"]=> int(0) 
    ["redirectts"]=> float(1690051470.7274) 
    ["totalrawsendsize"]=> int(0) 
  } 
  ["state"]=> array(18) { 
    ["async"]=> bool(false) 
    ["startts"]=> float(1690051470.7274) 
    ["redirectts"]=> float(1690051470.7274) 
    ["timeout"]=> bool(false) 
    ["tempoptions"]=> array(4) {
      ["sslopts"]=> array(7) { 
        ["ciphers"]=> string(587) "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA256:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS" 
        ["disable_compression"]=> bool(true) 
        ["allow_self_signed"]=> bool(false) 
        ["verify_peer"]=> bool(true) 
        ["verify_depth"]=> int(5) 
        ["SNI_enabled"]=> bool(true) 
        ["cafile"]=> string(56) "/xxxxxxxxxxxxxx/uws/tdcj-texas-gov-chain.pem" 
      } 
      ["debug"]=> bool(true) 
      ["streamtimeout"]=> int(300) 
      ["headers"]=> array(0) { } 
    } 
    ["httpopts"]=> array(1) { 
      ["headers"]=> array(0) { } 
    } 
    ["numfollow"]=> int(20) 
    ["numredirects"]=> int(0) 
    ["totalrawsendsize"]=> int(0) 
    ["profile"]=> string(4) "auto" 
    ["url"]=> string(73) "https://inmate.tdcj.texas.gov/InmateSearch/viewDetail.action?sid=04323857" 
    ["urlinfo"]=> array(11) { 
      ["scheme"]=> string(5) "https" 
      ["authority"]=> string(21) "inmate.tdcj.texas.gov" 
      ["login"]=> string(0) "" 
      ["loginusername"]=> string(0) "" 
      ["loginpassword"]=> string(0) "" 
      ["host"]=> string(21) "inmate.tdcj.texas.gov" 
      ["port"]=> string(0) "" 
      ["path"]=> string(31) "/InmateSearch/viewDetail.action" 
      ["query"]=> string(12) "sid=04323857" 
      ["queryvars"]=> array(1) { 
        ["sid"]=> array(1) { 
          [0]=> string(8) "04323857" 
        } 
      } 
      ["fragment"]=> string(0) "" 
    } 
    ["state"]=> string(10) "initialize" 
    ["httpstate"]=> bool(false) 
    ["result"]=> bool(false) 
    ["dothost"]=> string(22) ".inmate.tdcj.texas.gov" 
    ["cookiepath"]=> string(14) "/InmateSearch/" 
    ["options"]=> array(4) { 
      ["headers"]=> array(4) { 
        ["Accept"]=> string(37) "text/html, application/xhtml+xml, */*" 
        ["Accept-Language"]=> string(14) "en-us,en;q=0.5" 
        ["Cache-Control"]=> string(9) "max-age=0" 
        ["User-Agent"]=> string(78) "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0" 
      } 
      ["sslopts"]=> array(7) { 
        ["ciphers"]=> string(587) "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA256:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS" 
        ["disable_compression"]=> bool(true) 
        ["allow_self_signed"]=> bool(false) 
        ["verify_peer"]=> bool(true) 
        ["verify_depth"]=> int(5) 
        ["SNI_enabled"]=> bool(true) 
        ["cafile"]=> string(56) "/xxxxxxxxxxxxxx/uws/tdcj-texas-gov-chain.pem" 
      } 
      ["debug"]=> bool(true) 
      ["streamtimeout"]=> int(300) 
    } 
  } 
  ["errorcode"]=> string(15) "retrievewebpage" 
}

Sorry if it's something simple. I tried for a few hours to figure out what the issue might be and just was not getting anywhere. I did enjoy learning a bit about certs!

thanks!

cubiclesoft commented 1 year ago

You are getting a "connection timed out" response from the library. That most likely means the source IP address at HostGator is being blocked by a firewall somewhere that is choosing to drop TCP SYN packets instead of refusing the connection. Either the remote network (inmate.tdcj.texas.gov) is blocking the inbound connection from known IP ranges OR HostGater is blocking the outbound connection. Try running your scraper code locally on your own computer since dynamic IPs rarely get blocked. You could also try a different hosting environment like DigitalOcean, OVH, or AWS, but I'd wager that they have put all major global datacenters on a blacklist and will only whitelist IPs on a case-by-case basis. It could be something else entirely different though. I've seen some badly configured servers before that don't handle certain cipher suites correctly (i.e. the connection is accepted but then times out during the handshake).

Note that those domains resolve to two different IP addresses (i.e. different systems) and also do not respond to ping requests. The fact those systems aren't responding to basic ping packets means their IT staff are not managing their network properly.

syltrinket commented 1 year ago

Thanks so much! I'll try that. At one point, someone had been doing something similar to get the locations through a reverse proxy. That eventually stopped working, and I've been doing what I can come up with since. I'll also try going through a proxy, but I'm guessing that if they are blocking IPs, then they could be doing something to block proxies as well.

Thank you for your help! And for the ultimate-web-scraper tool because that has helped me learn a bit more!

Update: yes, it does work fine from a local server using the default GetSafeSSLOpts settings.

cubiclesoft commented 1 year ago

Mostly complete lists of public proxy IPs, Tor exit nodes, and VPN IP ranges are pretty widely available in formats that can be used to update firewall rules automatically. So even if you manage to find a proxy that works today, it'll probably break shortly after being added to those lists assuming they are doing IP blocking.

The best option is to get a static IP in one of the aforementioned datacenters (e.g. DigitalOcean) and then contact the Texas government and ask nicely to whitelist that IP. HostGator is primarily shared hosting. If they are doing mass IP blocking, they won't whitelist shared hosting providers. Using a VPS provider carries the implication that you know a few things and intend to be a massive pain the neck until they comply. If they refuse to whitelist a static IP but you have a valid use case and sufficient money, then get a lawyer and file a lawsuit to allow your access. Just filing the lawsuit will usually get them to back down and whitelist the IP unless they are idiots, want to look really bad publicly, and don't want to keep their jobs for very long. If you end up filing a lawsuit, make a big stink to your local media outlets as they love reporting on that kind of government malfeasance - I'm pretty sure "enjoys embarrassing and humiliating elected and appointed officials" is somewhere in their job description.

If you don't have a lot of money/time/resources, there's always the DIY approach by using your dynamic IP address at home. You push your home's dynamic IP address to the public web server so the public facing server knows what your home IP address is and then your public-facing web application uses that information to talk to your computer at home (i.e. effectively a floating reverse proxy). The computer at home performs the web scraping portion on behalf of the public facing server. Requires additional skills to setup involving configuring a "port forward" on a router and an always-on computer at home to handle inbound requests from the public web server. If you go this route, I recommend using a cheap, low wattage mini PC (e.g. Beelink) or a spare laptop lying around.

If your home ISP blocks most/all inbound connection traffic, then there's Remoted API Server:

https://github.com/cubiclesoft/remoted-api-server

That requires a LOT of specialized skills to setup, configure, and use, but it is specifically designed to completely circumvent all firewalls, is arguably a lot more secure compared to port forwarding, and can't be blocked by any existing technologies. I use Remoted API Server every day to successfully circumvent firewalls.

If their systems automatically block your home's dynamic IP address as it scrapes content, then just reset the router at home to get a new IP address. In short, they would have to mass block every IP address at your ISP in your area to stop you, which would immediately get the attention of local media outlets because no one else in the area would be able to access inmate information either. There are always workarounds to network blocks. All efforts to block web scrapers are always an exercise in futility that only annoys people who already know all of the tricks. Government entities are far better off providing their entire database in a bulk downloadable format (e.g. nightly ZIP file downloads) to alleviate the strain on their systems instead of attempting to block web scraping and utterly failing in the process.

syltrinket commented 1 year ago

I'm not sure this small convenience for our work is worth the effort, other than to just see if we can defeat their petty little interferences. The DIY approach with my home laptop is probably what I'll try just to see if I can get it to function (VPS is not really in our budget). I don't think Spectrum is blocking inbound connections.

You have provided way more than just a bug response, so thanks again for sharing your knowledge and helping me learn!