ThomasMertes / seed7

Source code of Seed7
GNU General Public License v2.0
207 stars 11 forks source link

Cannot download Rosetta Code URL #21

Closed celtic-coder closed 9 months ago

celtic-coder commented 9 months ago

Hi Thomas (@ThomasMertes),

This is a follow-up to issue https://github.com/ThomasMertes/seed7/issues/19. The following Seed7 code is not producing any output:

Seed7-RC-https-code

Complete Seed7 code

~~~ # https://rosettacode.org/wiki/Terminal_control/Display_an_extended_character # https://www.rosettacode.org/wiki/Terminal_control/Display_an_extended_character # See https://wheregoes.com/trace/20234415717/ # See "getHttps not downloading HTML" (https://github.com/ThomasMertes/seed7/issues/19) $ include "seed7_05.s7i"; include "gethttps.s7i"; include "scanfile.s7i"; # open, getc, getwd const proc: main is func local var string: RC_URL is ""; var string: Page_HTML is ""; var file: HTML_Output_File is STD_NULL; begin # No protocol (https://) # RC_URL := "web.whatsapp.com/"; RC_URL := "www.rosettacode.org/wiki/Terminal_control/Display_an_extended_character"; Page_HTML := getHttps(RC_URL); if Page_HTML <> "" then HTML_Output_File := open("Test-https-RC.html", "w"); writeln("Writing to Output File..."); write(HTML_Output_File, Page_HTML); close(HTML_Output_File); end if; end func; # Note: Rex Swain Http Viewer (https://www.rexswain.com/httpview.html) has a problem with the URL (SSL error!) ~~~

As can be seen in the code, I tested the "getHttps" on a WhatsApp URL as a further test and this worked without a problem.

What might the issue be with the Rosetta Code URL? Interestingly, as I noted in the code itself, Rex Swain's HTTP Viewer also has a problem with the site, but my Firefox browser does not have any problem. The folks at Rosetta Code have been having recent problems with their hosting provider (WikiTide), but this seems to be resolved, and is probably not the source of this specific problem.

Might the "getHttps" function be able to provide an HTTP Status code, for example, if no text information is returned from the function? I am working on an exercise with the Rosetta Code site, and it would be useful to use Seed7 for both extracting and processing the URLs in which I am interested.

Kind Regards, Liam

ThomasMertes commented 9 months ago

Hi Liam,

Sorry for the delay (see below). I added a line to check if getHttps was successful:

    Page_HTML := getHttps(RC_URL);
    writeln(length(Page_HTML));

When I did my first test this statement wrote 0. From that I concluded that www.rosettacode.org might use an encryption that tls.s7i does not support. I started investigating which encryption that would be.

The next day the test wrote 168380 without any changes in tls.s7i. This was strange but getHttps was able to read something. This test failed a few lines later with a RANGE_ERROR, because the downloaded page actually contains Unicode characters beyond '\255;' and a byte-file cannot support that. To fix that I added the include:

  include "utf8.s7i";

and I opened the file with openUtf8 instead of open:

      if Page_HTML <> "" then
        HTML_Output_File := openUtf8("Test-https-RC.html", "w");

With this change the file Test-https-RC.htm is written with UTF-8 encoding. In contrast to that open allows only Latin-1 files and an attempt to write a character beyond Latin-1 triggers a RANGE_ERROR.

For the last 5 days I did work on improvements for tls.s7i (they are still not finished). This caused the delay in my answer.

Tell me if the openUtf8 fix works for you. If not you need to wait on my improvements for tls.s7i.

Kind Regards, Thomas

celtic-coder commented 9 months ago

Hi Thomas (@ThomasMertes),

First, don't worry about the delay in responding to this issue. I will quote the phrase used regularly here in Ireland, "You are grand!" -- loosely translated in this situation as "You don't need to be bothered about that, at all". Thank you for taking those past five days to work on the tls.s7i include file. So, don't worry about the delay in replying.

Secondly, I can confirm that by using the utf8.s7i include and the openUtf8, that I am able to download the URL without any problem. Interestingly, I got a value of 168379 for the length, rather than your 168380, but that is not of any consequence.

The download is exactly identical to the one that I get when I use the cURL utility (as noted in issue https://github.com/ThomasMertes/seed7/issues/19), but also the getHttps properly follows both 301 redirects from the plain "http" of "www.rosettacode.org/wiki/Terminal_control/Display_an_extended_character" to the "https" version of "rosettacode.org/wiki/Terminal_control/Display_an_extended_character".

Being able to use Seed7 for both downloading and then processing the URLs will streamline the project that I am currently working on. Thanks for your assistance with this endeavour!

Kind Regards, Liam