elementor / wp2static

WordPress static site generator for security, performance and cost benefits
https://wp2static.com
The Unlicense
1.39k stars 257 forks source link

Crawling from WP CLI - Mangled URL - Uncaught InvalidArgumentException: Unable to parse URI #908

Open stellarpower opened 8 months ago

stellarpower commented 8 months ago

Before creating an issue / filing a support request

Determining if it's an issue with Theme, Plugin, Environment or a bug in WP2Static

This is a difference in behaviour between the CLI and web interfaces to WP2Static. So, even if there are issues elsewhere, I believe this is most appropriate as a bug against WP2Static itself for the time being.

Describe the bug So far, WP2Static has worked without a hitch. I have installed from a ZIP file and only ever processed form the web UI.

This site is hosted on a local machin in a contianer; after exporting, the static site is sent up into the cloud for live hosting. _In the settings, I set the "Deployment URL" to be simply /; this has allowed maximal flexibility with hosting downstream, where the static site can be viewed under multiple subdomains without any problems. The export process from the web UI works fine with this.

I took some time today to have a play kicking off the process programmatically using the wp CLI tool. If I begin the export this way (wp wp2static crawl), from the logs it fetches all the pages okay, and then after this, at some point it seems an invalid URL replacement is being performed - or in some other manner, a totally mangled URL comes out. Then, this is throwing an exception and I get a backtrace in the logs.

If I then proceed to generate the export again from the web UI, I get a 500 message back from the browser same as this one

If I delete the plugin, and re-upload from a zip, to nuke my settings (can I do this a faster way, BTW?); go back and change my settings, then we are back to normal. Given the documentation seems to be a little out of date, it's possible I am not using the CLI tool properly. Ideally I'd like it to kick off a job with the exact same settings as currently configured in the UI; but perhaps I need to give it some more options. Otherwise, this seems to suggest that the CLI tool is missing or adding a step that mutates the state in the settings, and so then web-based calls are failing too.

To Reproduce Steps to reproduce the behavior:

Environment (please complete the following information):

Log files (please complete the following information):

[04-Nov-2023 02:55:56 UTC] PHP Fatal error:  Uncaught InvalidArgumentException: Unable to parse URI: https://machine.domain:888http/machine.domain:888/wp-content/et-cache/1010/et-core-unified-1010.min.css in /var/www/html/sitename/wp-content/plugins/wp2static/vendor/leonstafford/wp2staticpsr7/src/Uri.php:72
Stack trace:
#0 /var/www/html/sitename/wp-content/plugins/wp2static/vendor/leonstafford/wp2staticpsr7/src/Request.php(42): WP2StaticGuzzleHttp\Psr7\Uri->__construct()
#1 /var/www/html/sitename/wp-content/plugins/wp2static/src/Crawler.php(136): WP2StaticGuzzleHttp\Psr7\Request->__construct()
#2 /var/www/html/sitename/wp-content/plugins/wp2static/vendor/leonstafford/wp2staticguzzle/src/Pool.php(56): WP2Static\Crawler->WP2Static\{closure}()
#3 [internal function]: WP2StaticGuzzleHttp\Pool::WP2StaticGuzzleHttp\{closure}()
#4 /var/www/html/sitename/wp-content/plugins/wp2static/vendor/leonstafford/wp2staticpromises/src/EachPromise.php(212): Generator->next()
#5 / in /var/www/html/sitename/wp-content/plugins/wp2static/vendor/leonstafford/wp2staticpsr7/src/Uri.php on line 72
[2023-11-04T02:12:43+00:00] Starting crawling
[2023-11-04T02:12:43+00:00] Using basic auth credentials to crawl
[2023-11-04T02:12:43+00:00] Starting to crawl detected URLs.
[2023-11-04T02:12:43+00:00] Using CrawlCache.
[2023-11-04T02:13:21+00:00] Crawling progress: 300 crawled, 300 skipped (cached).
[2023-11-04T02:13:25+00:00] Crawling progress: 600 crawled, 600 skipped (cached).
[2023-11-04T02:13:29+00:00] Crawling progress: 900 crawled, 900 skipped (cached).
[2023-11-04T02:13:32+00:00] Crawling progress: 1200 crawled, 1200 skipped (cached).
[2023-11-04T02:13:45+00:00] Crawling progress: 1500 crawled, 1500 skipped (cached).
[2023-11-04T02:13:51+00:00] Crawling progress: 1800 crawled, 1800 skipped (cached).
[2023-11-04T02:13:54+00:00] Crawling progress: 2100 crawled, 2100 skipped (cached).
[2023-11-04T02:13:58+00:00] Crawling progress: 2400 crawled, 2400 skipped (cached).
[2023-11-04T02:14:01+00:00] Crawling progress: 2700 crawled, 2700 skipped (cached).
[2023-11-04T02:14:05+00:00] Crawling progress: 3000 crawled, 3000 skipped (cached).
[2023-11-04T02:14:09+00:00] Crawling progress: 3300 crawled, 3300 skipped (cached).
[2023-11-04T02:14:12+00:00] Crawling progress: 3600 crawled, 3600 skipped (cached).
[2023-11-04T02:14:17+00:00] Crawling progress: 3900 crawled, 3900 skipped (cached).
[2023-11-04T02:14:22+00:00] Crawling progress: 4200 crawled, 4200 skipped (cached).
[2023-11-04T02:14:25+00:00] Crawling progress: 4500 crawled, 4500 skipped (cached).
[2023-11-04T02:14:28+00:00] Crawling progress: 4800 crawled, 4800 skipped (cached).
[2023-11-04T02:14:32+00:00] Crawling progress: 5100 crawled, 5100 skipped (cached).
stellarpower commented 1 month ago

Logging this as an error rather than failing here

patrickdk77 commented 1 month ago

The issue is, your site is configured for https, but some things are returning http urls, and wp2static doesn't respect that http and https should be respected as the same, so http://example.com != https://example.com and causes this error

stellarpower commented 1 month ago

That makes sense I guess. I don't know where I would find the plugin that is the culprit, the thing is behind a reverse proxy so it all should be pointing to the ingress and never use unencrypted.

How come it's bunged on the end of the URL though? From memory I thought I added some logging and that literal URL was what was trying to be parsed by Guzzler.