elementor / static-html-output

Static HTML Output Plugin for WordPress
https://statichtmloutput.com
The Unlicense
125 stars 35 forks source link

Crawl until exhausted, not just 2 levels #85

Closed leonstafford closed 4 years ago

leonstafford commented 4 years ago

As referenced in #81 and #79

Issue surrounding detecting too much (ie, admin view CSS files, not used on site) and detecting too little (ie, WPML paginated URLs).

Ironically, normal full spidering was how this plugin used to function many years ago, more like SimplerStatic/SimplyStatic.

Not wanting to totally rearchitect this plugin (that's what WP2Static is for), a relatively simple adjustment should be made to continually crawl until all new discovered internal URLs have been crawled.

Previous shortcomings in the HTML/CSS parsing failed to detect all assets, which is where the greedy detection was a useful workaround, but should no longer be needed.

This proposed change is not the ultimate in elegance/extensibility, but should provide a good improvement without requiring too much effort.

flow

Note: crawl progress will be harder to measure in % terms, should shift to Detected vs crawled

100 URLs Detected, 0 Processed
...
100 URLs Detected, 75 Processed
120 URLs Detected, 80 Processed
...
120 URLs Detected, 120 Processed

Crawler knows to continue only by the crawl_queue not being empty.

Tasks

leonstafford commented 4 years ago

@crstauf a heads up on the progress indicator as mentioned in the notes here.

If we use a % based progress indicator, it will keep fluctuating, due to newly discovered URLs being added while crawling.

Do you think something like above will be fine for showing progress?

There is also a new export log, with the raw data looking like this, with URL, note, status, that will able to be inspected while the export is running if you want more info on what's going on (status 777 is for now denoting files excluded based on a new Excludes table (plugin + user added exclusion patterns) :

+----+----------------------------------------------------------------------------+----------------------------------------------------------+--------+
| id | url                                                                        | note                                                     | status |
+----+----------------------------------------------------------------------------+----------------------------------------------------------+--------+
|  1 | /                                                                          | initial_crawl_list                                       |    200 |
|  2 | /arrobase/                                                                 | initial_crawl_list                                       |    200 |
|  3 | /author-sitemap.xml                                                        | initial_crawl_list                                       |    404 |
|  4 | /blog/                                                                     | initial_crawl_list                                       |    200 |
|  5 | /category-sitemap.xml                                                      | initial_crawl_list                                       |    200 |
|  6 | /de-home/                                                                  | initial_crawl_list                                       |    200 |
|  7 | /favicon.ico                                                               | initial_crawl_list                                       |    404 |
|  8 | /french-home/                                                              | initial_crawl_list                                       |    200 |
|  9 | /page-sitemap.xml                                                          | initial_crawl_list                                       |    200 |
| 10 | /page-with-external-url-in-style/                                          | initial_crawl_list                                       |    200 |
| 11 | /post-sitemap.xml                                                          | initial_crawl_list                                       |    200 |
| 12 | /robots.txt                                                                | initial_crawl_list                                       |    404 |
| 13 | /sample-page/                                                              | initial_crawl_list                                       |    200 |
| 14 | /sassytest/                                                                | initial_crawl_list                                       |    200 |
| 15 | /sitemap.xml                                                               | initial_crawl_list                                       |    200 |
| 16 | /sitemap_index.xml                                                         | initial_crawl_list                                       |    200 |
| 17 | /comments/feed/                                                            | discovered on: /                                         |    200 |
| 18 | /feed/                                                                     | discovered on: /                                         |    200 |
| 19 | /wp-content/themes/twentytwenty/assets/js/index.js                         | discovered on: /                                         |    200 |
| 20 | /wp-content/themes/twentytwenty/print.css                                  | discovered on: /                                         |    200 |
| 21 | /wp-content/themes/twentytwenty/style.css                                  | discovered on: /                                         |    200 |
| 22 | /wp-content/uploads/2020/05/arrobase@image.jpg                             | discovered on: /                                         |    200 |
| 23 | /wp-includes/css/dist/block-library/style.min.css                          | discovered on: /                                         |    200 |
| 24 | /wp-json/                                                                  | discovered on: /                                         |    777 |
| 25 | /wp-json/oembed/1.0/embed                                                  | discovered on: /                                         |    777 |
| 26 | /sample-page/feed/                                                         | discovered on: /sample-page/                             |    200 |
| 27 | /wp-admin/                                                                 | discovered on: /sample-page/                             |    200 |
| 28 | /wp-admin/css/forms.min.css                                                | discovered on: /wp-admin/                                |    200 |
| 29 | /wp-admin/css/l10n.min.css                                                 | discovered on: /wp-admin/                                |    200 |
| 30 | /wp-admin/css/login.min.css                                                | discovered on: /wp-admin/                                |    200 |
| 31 | /wp-admin/js/password-strength-meter.min.js                                | discovered on: /wp-admin/                                |    200 |
| 32 | /wp-admin/js/user-profile.min.js                                           | discovered on: /wp-admin/                                |    200 |
| 33 | /wp-includes/css/buttons.min.css                                           | discovered on: /wp-admin/                                |    200 |
| 34 | /wp-includes/css/dashicons.min.css                                         | discovered on: /wp-admin/                                |    200 |
| 35 | /wp-includes/js/jquery/jquery-migrate.min.js                               | discovered on: /wp-admin/                                |    200 |
| 36 | /wp-includes/js/jquery/jquery.js                                           | discovered on: /wp-admin/                                |    200 |
| 37 | /wp-includes/js/underscore.min.js                                          | discovered on: /wp-admin/                                |    200 |
| 38 | /wp-includes/js/wp-util.min.js                                             | discovered on: /wp-admin/                                |    200 |
| 39 | /wp-includes/js/zxcvbn-async.min.js                                        | discovered on: /wp-admin/                                |    200 |
| 40 | /wp-content/themes/twentytwenty/assets/fonts/inter/Inter-italic-var.woff2  | discovered on: /wp-content/themes/twentytwenty/style.css |    200 |
| 41 | /wp-content/themes/twentytwenty/assets/fonts/inter/Inter-upright-var.woff2 | discovered on: /wp-content/themes/twentytwenty/style.css |    200 |
| 42 | /wp-admin/images/loading.gif                                               | discovered on: /wp-admin/css/forms.min.css               |    200 |
| 43 | /wp-admin/images/w-logo-blue.png                                           | discovered on: /wp-admin/css/login.min.css               |    200 |
| 44 | /wp-admin/images/wordpress-logo.svg                                        | discovered on: /wp-admin/css/login.min.css               |    200 |
| 45 | /wp-includes/fonts/dashicons.eot                                           | discovered on: /wp-includes/css/dashicons.min.css        |    200 |
| 46 | /wp-includes/fonts/dashicons.ttf                                           | discovered on: /wp-includes/css/dashicons.min.css        |    200 |
+----+----------------------------------------------------------------------------+----------------------------------------------------------+--------+
crstauf commented 4 years ago

@leonstafford When I was first implementing the progress bar, I toyed with adjusting the progress bar total, but gave that up for some reason. Possible that the changes may now allow (or at least be easier to implement) progress bar adjustment.

If you're in a hurry to get this out, you can drop the progress bars, otherwise, I should be able to spend some time on this tonight and see what I can figure out.

Edit: I probably won’t get to it tonight: need rest. Should be done before Monday.

leonstafford commented 4 years ago

@crstauf no hurry, I'll have the UI showing some progress, just won't be a progressive %, due to fluctuating totals as new URLs are discovered. Will show what I get there and maybe you can jazz it up for CLI. I should get this PR merged in today/tomorrow, then we'll have some testing/adjusting time before release

crstauf commented 4 years ago

@leonstafford It does not appear that the changes that the infinite-crawl implements have made their way into the CLI generate command. Is that intentional? I'm going to do my best to get things updated to use the new method, but I may get stuck and need support.

crstauf commented 4 years ago

@leonstafford CLI progress bar has been implemented in infinite-crawl branch on my fork; how do you want me to submit the changes?

I'm going to do my best to get things updated to use the new method

I think I got it figured... please check https://github.com/crstauf/static-html-output-plugin/commit/889ee5dc8f7966c3e3641b19121155ee1d12d1c2 for the adjustments.

leonstafford commented 4 years ago

@crstauf amazing! Thanks for dealing with the WIP branch!

If you're awake still and can PR that to the infinite-crawl branch, I'll merge it in, else I'll manually patch in from your branch, looks great!

leonstafford commented 4 years ago

@crstauf I'll then proceed today to work on the deployment side and can replicate your progress indicator there (I plan to change a few things to get deployment progress working in UI, so I wouldn't worry about it in CLI until that's done

crstauf commented 4 years ago

@leonstafford I can submit a PR in ~2 hours. If you decide to do otherwise, lmk.

crstauf commented 4 years ago

@leonstafford Actually, just figured it out with Working Copy.

leonstafford commented 4 years ago

Perfect, thanks!

leonstafford commented 4 years ago

closing this, separate issue/PR tracking CLI progress indicators #99