elementor / wp2static

WordPress static site generator for security, performance and cost benefits
https://wp2static.com
The Unlicense
1.42k stars 266 forks source link

Multiple simultaneous site generations? #621

Closed Flynsarmy closed 3 years ago

Flynsarmy commented 4 years ago

When browsing to Admin - WP2Static - Run page and clicking Generate Static Site button, if I wait a bit, refresh the page and click the button again, multiple site generations are now taking place simultaneously. You can see this in the below log:

2020-08-03 15:47:01: Crawling progress: 5400 crawled, 1 skipped (cached). 2020-08-03 15:46:15: Crawling progress: 11100 crawled, 0 skipped (cached). 2020-08-03 15:45:17: Crawling progress: 5700 crawled, 1 skipped (cached). 2020-08-03 15:44:29: Crawling progress: 6300 crawled, 1 skipped (cached). 2020-08-03 15:43:36: Crawling progress: 5400 crawled, 1 skipped (cached). 2020-08-03 15:42:34: Crawling progress: 5100 crawled, 1 skipped (cached). 2020-08-03 15:41:34: Crawling progress: 10800 crawled, 0 skipped (cached). 2020-08-03 15:41:05: Crawling progress: 5400 crawled, 1 skipped (cached). 2020-08-03 15:40:05: Crawling progress: 6000 crawled, 1 skipped (cached). 2020-08-03 15:39:05: Crawling progress: 5100 crawled, 1 skipped (cached). 2020-08-03 15:38:40: Crawling progress: 4800 crawled, 0 skipped (cached). 2020-08-03 15:37:06: Crawling progress: 10500 crawled, 0 skipped (cached). 2020-08-03 15:36:38: Crawling progress: 5100 crawled, 1 skipped (cached). 2020-08-03 15:35:40: Crawling progress: 5700 crawled, 1 skipped (cached).

After pressing the button I didn't notice any jobs being created on the Jobs page, I don't know how to cancel a generation once it's begun (and for large sites like mine this could be a problem).

leonstafford commented 4 years ago

How to cancel in this scenario?

Restart php/webserver. Sucks, but until we get proper async background tasks in place, going to remain an issue.

Had thought about adding some other locking mechanism, but would require some resource consumption each cycle, either hitting DB or filesystem to check for lock (I assume?).

@john-shaffer feelings either way?

john-shaffer commented 4 years ago

I'm used to having proper threading and locking in the language's core, so I'm a little lost when it comes to PHP.

deliciousbrain's offload media plugin is fantastic, so their reputation is good, but this is insane to me:

If your site is behind BasicAuth, both async requests and background processes will fail to complete. This is because WP Background Processing relies on the WordPress HTTP API, which requires you to attach your BasicAuth credentials to requests.

Why the hell would you make an HTTP request to call a local function? I know WP does it because I've wasted days troubleshooting those issues, but WP does a lot of unspeakable things. It also uses WP Cron, which is unreliable in my experience. It uses transients for "locking", but transients can disappear at any time - https://github.com/deliciousbrains/wp-background-processing/issues/39. This looks like a "now you have two problems" situation.

I think that the biggest problem right now is simply the lack of chunking, particularly in the crawl job. If we could break crawl into chunks of 20-50 URLs, then we can save the start time in the DB and allow an appropriate amount of time. E.g., if we have chunks of 20 and set our curl timeout to 30 seconds, then after a little over 10 minutes we can assume that the job has died and start another one. Maybe add a chunk_start_time column to the jobs table, and perhaps crawled_time to the URLs table? It seems like that would also permit adding new URLs during the crawl, since our query will be something like SELECT * FROM wp_wp2static_urls WHERE crawled_time IS NULL OR crawled_time <= $time LIMIT 20. Chunking would also help with the issues people are having with large sites.

We can prevent simultaneous starts with DB-level locking. I prefer DB locks, because I don't know if we can count on locking behaving consistently across different OSes and file systems. I don't think that using the DB for locking will measurably affect performance, as it seems that generating the pages is about 99% of the workload.

I think our existing scheduling is fine. The best setup is to have a cron job running wp wp2static process_queue, but WP Cron can be used as well. That seems ideal.

However, I think that any revamp of the crawl queue should consider some way to crawl selectively. I'm not sure if that's terribly practical with WP's design, but having imperfect change detection is better than not being usable on large sites at all. This could be as simple as nulling out crawled_time when a post is changed.

leonstafford commented 4 years ago

Great food for thought, many thanks!

Re selective crawl queuing, I think the way Static HTML Output is doing it now will solve these for your case? ie, very minimal URL detection and then crawling all detected URLs from site, recursively/infinite levels. That's what my open (stale) branch in this repo is starting towards.

You mentioned in another issue in that it is still trying to crawl all URLs each time - does the current crawl caching not prevent this if used? ie, the issue your stating is with the extra check of each URL, regardless of whether it skips it due to caching or not? Maybe we can start a discussion on forum/new issue to address that specifically and I can rethink approach in here.

john-shaffer commented 4 years ago

The crawl caching just skips saving to disk. It still crawls every URL each time, which is why there's no real speed difference between having crawl caching on or off. For large sites, we need to skip hitting the URL at all.

I'm not familiar with the Static HTML Output source, but it looks like it adds URLS to the crawl queue when found, and deletes them as it crawls them. I assume it checks against a crawl cache so it doesn't re-add an already crawled URL? That should work fine too. I'm not sure if the churn from adding and deleting rows would slow it down noticeably. I need to make a huge site and do some speed tests.

john-shaffer commented 4 years ago

I'm exploring and testing some of this in #633

john-shaffer commented 3 years ago

Fixed in https://github.com/leonstafford/wp2static/commit/a83d56299f738da5046efb19f476c6c8ff15a11c due to the use of MySQL locks.