ThomasMertes / seed7

Source code of Seed7
GNU General Public License v2.0
207 stars 11 forks source link

getHttps not downloading HTML #19

Closed celtic-coder closed 11 months ago

celtic-coder commented 11 months ago

Hi Thomas (@ThomasMertes),

The following Seed7 code is not producing any output:

Seed7-getHttps-Code-Listing

However, the following cURL command works correctly:

curl.exe "https://example.com/" --compressed -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:109.0) Gecko/20100101 Firefox/116.0" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" -H "Accept-Language: en-US,en;q=0.5" -H "Accept-Encoding: gzip, deflate, br" -H "DNT: 1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "Sec-Fetch-Dest: document" -H "Sec-Fetch-Mode: navigate" -H "Sec-Fetch-Site: cross-site" -H "Pragma: no-cache" -H "Cache-Control: no-cache" --output example.com.html

Although, this is the cURL from https://curl.se/. The native cURL in Windows 10 fails with an error that the installed libcurl version doesn't support the "--compressed" option.

When I compile the program with the following options: s7c -tf -p, when run it gives the following output:

-> main
-> 7243_getHttps
-> 4165_getHttpLocation
-> 394_isDigitString
<- 394_isDigitString
-> 394_isDigitString
<- 394_isDigitString
<- 4165_getHttpLocation
-> 7220_openHttps
-> 7171_openTlsSocket
-> 2177_openInetSocket
-> 2170_openSocket
<- 2170_openSocket
<- 2177_openInetSocket
-> 7165_openTlsSocket
<- 7165_openTlsSocket
<- 7171_openTlsSocket
<- 7220_openHttps
<- 7243_getHttps
<- main

This would seem to indicate that the program is going through the correct steps, starting and ending with the getHttps. Also, the profile output gives:

usecs   calls   place   name
31997   1   Download-HTML.sd7(4)    main
26997   1   /c/seed7/lib/gethttps.s7i(123)  getHttps
24010   1   /c/seed7/lib/gethttps.s7i(33)   openHttps
20005   1   /c/seed7/lib/tls.s7i(1864)  openTlsSocket
16002   1   /c/seed7/lib/socket.s7i(182)    openInetSocket
998 1   /c/seed7/lib/gethttp.s7i(61)    getHttpLocation
0   2   /c/seed7/lib/seed7_05.s7i(792)  isDigitString
0   1   /c/seed7/lib/socket.s7i(133)    openSocket
0   1   /c/seed7/lib/tls.s7i(1828)  openTlsSocket

Given that cURL is working correctly, it might further indicate a possible problem with the getHttps function on my Windows 10 laptop. Are there any other steps where I could continue to troubleshoot this problem?

Kind Regards, Liam

ThomasMertes commented 11 months ago

Hi Liam,

Thank you for your report. It helped me to fix the wrong documentation of getHttps.

The function getHttps actually expects a location as parameter instead of an URL. A location is an URL without https:// or http:// at the beginning.

I have corrected the wrong documentation in gethttps.s7i and gethttp.s7i

So, if you use getHttps("example.com") instead of getHttps("https://example.com/"), it will probably work. At least on my computer I succeed with:

$ include "seed7_05.s7i";
  include "gethttps.s7i";

const proc: main is func
local
var string: Page_HTML is "";
  begin
  Page_HTML := getHttps("example.com");
          if Page_HTML <> "" then
            writeln(Page_HTML);
          end if;
  end func;

This writes the HTML of example.com. Now I could create a DOM from the HTML string with readHtml.

When I implemented getHttp it seemed ridiculous to specify http twice (in the function name and in the URL). So I decided that http:// must be omitted from the URL. I just forgot to document my approach and I also kept url as parameter name. Now the parameter is named location and the documentation contains an explanation and some examples.

celtic-coder commented 11 months ago

Hi Thomas (@ThomasMertes),

Thanks for making the documentation changes! I can confirm that the HTML for example.com now downloads as expected.

Kind Regards, Liam