elementor / wp2static

WordPress static site generator for security, performance and cost benefits
https://wp2static.com
The Unlicense
1.42k stars 266 forks source link

Incorrect detection of portion of URLs on some environments #582

Open claudiobrandt opened 4 years ago

claudiobrandt commented 4 years ago

This is in reference to URLs being detected but not included in the generated static site.

2759 URLs were detected and crawled but only 1071 URLs were included in the generated site. The list stopped where URLs changed from relative to full URI. No phperrorlog, and not records in both local firewall and Cloudflare's.

claudiobrandt commented 4 years ago

Hi Leon,

I now believe this issue has to do with Autoptmize's latest update, to version 2.7.3. I found this out because I needed to make a minor update to one site that I had already exported using WP2Static 7.0-7 and successfully deployed to Cloudflare Workers Sites, and it behaved the exact same way my other site was behaving, stopping before the first Autoptimize cached file. I rolled back AO to its 2.7.2 version and the whole process worked as before. You may want to have a look at AO's changelog and see what changes could be causing this, and if this is a bug on their part of just a conflict between the two plugins.

Thanks!

john-shaffer commented 4 years ago

You may want to have a look at AO's changelog

Is there a changelog? I don't see one.

claudiobrandt commented 4 years ago

A changelog for AO is available here: https://wordpress.org/plugins/autoptimize/#developers

However, as I said on the staticword.press forum, I'm not so sure anymore the issue has to do with AO, as only one of my 2 websites with WP2Static was able to regenerate the site correctly after rolling back to the previous version of AO. The error persists in the other site.

john-shaffer commented 4 years ago

A changelog for AO is available here: https://wordpress.org/plugins/autoptimize/#developers

Thanks. Unfortunately, they might as well have said "2.7.3: Changed some things". Skimming the commit log, nothing stands out.

I don't use AO on any of my static sites as it consistently has issues. (Most of the sites have custom themes that don't need post-processing). Can you try generating the site with AO disabled to verify if it is related?

claudiobrandt commented 4 years ago

The issue now doesn't seem related to Autoptimize. I did remove Autoptimize after purging its cache, and even removed the cache folder it uses. The issue remained. Over 2,500 URLs are crawled, less than 1,000 are processed, and static site is broken without many JS and CSS files.

The Crawled lists is followed up to when it starts generating full URLs, instead of relative paths. Also, these URLs are http:// instead of https://. My sites are configured on wp-config.php to use https. I removed the http to https redirect in the .htaccess file, as well as other security directives in it, but unfortunately the issue is still there.

JamesColeman-AH commented 3 years ago

I ran across this issue myself and found the cause of the issue. Reviewing the WordPress is_ssl() function (https://github.com/WordPress/WordPress/blob/dfc3eeff10c81c52fff3825869f583763cad0c58/wp-includes/load.php#L1362) is detecting the site as not SSL if the HTTP environment variable is not set. When its not detecting SSL, the set_url_scheme() function (https://github.com/WordPress/WordPress/blob/35f6c356c12d161456c6ab8df6587f414dba2a51/wp-includes/link-template.php#L3659) will automatically set the scheme to http:// instead of https://.

To fix this, I'm adding the following before my wp-cli calls:

export HTTPS=on

Just thought I'd share this information. It may be a good idea to add this to the https://wp2static.com/developers/wp-cli/ docs.

leonstafford commented 3 years ago

Great, thanks for sharing, @JamesColeman-AH!

I was editing something similar we have, recently introduced with the sitemap parsing functionality, IIRC, in /src/URLHelper.php:

class URLHelper {
    public static function isSecure() : bool {
        return ( ! empty( $_SERVER['HTTPS'] ) && $_SERVER['HTTPS'] !== 'off' ) ||
            $_SERVER['SERVER_PORT'] == 443;
    }

    /*
     * Returns the current full URL including querystring
     *
     * @return string
     */
    public static function getCurrent() : string {
        $scheme = self::isSecure() ? 'https' : 'http';
        $url = $scheme . '://' . $_SERVER['HTTP_HOST'];

        // Only include port number if needed
        if ( ! in_array( $_SERVER['SERVER_PORT'], [ 80, 443 ] ) ) {
            $url .= ':' . $_SERVER['SERVER_PORT'];
        }

        $url .= $_SERVER['REQUEST_URI'];

        return $url;

Which also feels brittle and should add a check for the WP Site URL, also. That one's in our codebase, at least.

In your instance, what do you have set for WordPress's Site URL and if it's http, is your webserver rewriting to https?

WP2Static/Static HTML Output assume your WP Site URL is the same as the crawl URL. There is a little bit of code to help force any links accidentally left in a mismatched protocol. There's an option to specify a custom port to help with crawling non-standard ports, but I've avoided adding an option to specify a completely different URL for crawling, as the extra logic will complicate my days 😸

So, I'd like to be able to just say

Your WordPress development site needs to have WordPress' Site URL option defined as exactly the URL you use to access the site, ie if you access https://mydomain.com then the Site URL option should match exactly, including the https protocol, not http://mydomain.com.

I'd like to hear more about your setup and better understand the issue before I add that line to the WP_CLI docs, but happy to once it's clear to me.

JamesColeman-AH commented 3 years ago

Our Site URL is set to https://example.com/ and when I was trying to use wp-cli to detect/craw, I was getting errors where the DetectPluginAssets class was returning http urls for plugins. After troubleshooting, that's where I found that I needed to export HTTPS=on.

leonstafford commented 3 years ago

Thanks @JamesColeman-AH I'll look into that class and some more checks for the URLs being returned in the wrong protocol.

I've added your note to the WP_CLI docs:

Screen Shot 2020-12-09 at 1 03 58 pm

fertek commented 1 year ago

For me the problem was in this code in the DetectPluginAssets.php file.

# The value returned by SiteInfo::getUrl( 'plugins' ) is http://example.com/wp-content/plugins/.

$detected_filename =
    str_replace(
        get_home_url(), # e.g. value: https://example.com
        '',
        $detected_filename # e.g. value: http://example.com/wp-content/plugins/contact-form-7/admin/css/styles.css
    );

Because the protocol does not match (https:// from get_home_url() vs http:// from SiteInfo::getUrl( 'plugins' )), the string http:/example.com/wp-content/plugins/contact-form-7/admin/css/styles.css is inserted into the wp2static_urls table instead of the correct /wp-content/plugins/contact-form-7/admin/css/styles.css.