Closed leonstafford closed 4 years ago
@crstauf a heads up on the progress indicator as mentioned in the notes here.
If we use a % based progress indicator, it will keep fluctuating, due to newly discovered URLs being added while crawling.
Do you think something like above will be fine for showing progress?
There is also a new export log, with the raw data looking like this, with URL, note, status, that will able to be inspected while the export is running if you want more info on what's going on (status 777 is for now denoting files excluded based on a new Excludes table (plugin + user added exclusion patterns) :
+----+----------------------------------------------------------------------------+----------------------------------------------------------+--------+
| id | url | note | status |
+----+----------------------------------------------------------------------------+----------------------------------------------------------+--------+
| 1 | / | initial_crawl_list | 200 |
| 2 | /arrobase/ | initial_crawl_list | 200 |
| 3 | /author-sitemap.xml | initial_crawl_list | 404 |
| 4 | /blog/ | initial_crawl_list | 200 |
| 5 | /category-sitemap.xml | initial_crawl_list | 200 |
| 6 | /de-home/ | initial_crawl_list | 200 |
| 7 | /favicon.ico | initial_crawl_list | 404 |
| 8 | /french-home/ | initial_crawl_list | 200 |
| 9 | /page-sitemap.xml | initial_crawl_list | 200 |
| 10 | /page-with-external-url-in-style/ | initial_crawl_list | 200 |
| 11 | /post-sitemap.xml | initial_crawl_list | 200 |
| 12 | /robots.txt | initial_crawl_list | 404 |
| 13 | /sample-page/ | initial_crawl_list | 200 |
| 14 | /sassytest/ | initial_crawl_list | 200 |
| 15 | /sitemap.xml | initial_crawl_list | 200 |
| 16 | /sitemap_index.xml | initial_crawl_list | 200 |
| 17 | /comments/feed/ | discovered on: / | 200 |
| 18 | /feed/ | discovered on: / | 200 |
| 19 | /wp-content/themes/twentytwenty/assets/js/index.js | discovered on: / | 200 |
| 20 | /wp-content/themes/twentytwenty/print.css | discovered on: / | 200 |
| 21 | /wp-content/themes/twentytwenty/style.css | discovered on: / | 200 |
| 22 | /wp-content/uploads/2020/05/arrobase@image.jpg | discovered on: / | 200 |
| 23 | /wp-includes/css/dist/block-library/style.min.css | discovered on: / | 200 |
| 24 | /wp-json/ | discovered on: / | 777 |
| 25 | /wp-json/oembed/1.0/embed | discovered on: / | 777 |
| 26 | /sample-page/feed/ | discovered on: /sample-page/ | 200 |
| 27 | /wp-admin/ | discovered on: /sample-page/ | 200 |
| 28 | /wp-admin/css/forms.min.css | discovered on: /wp-admin/ | 200 |
| 29 | /wp-admin/css/l10n.min.css | discovered on: /wp-admin/ | 200 |
| 30 | /wp-admin/css/login.min.css | discovered on: /wp-admin/ | 200 |
| 31 | /wp-admin/js/password-strength-meter.min.js | discovered on: /wp-admin/ | 200 |
| 32 | /wp-admin/js/user-profile.min.js | discovered on: /wp-admin/ | 200 |
| 33 | /wp-includes/css/buttons.min.css | discovered on: /wp-admin/ | 200 |
| 34 | /wp-includes/css/dashicons.min.css | discovered on: /wp-admin/ | 200 |
| 35 | /wp-includes/js/jquery/jquery-migrate.min.js | discovered on: /wp-admin/ | 200 |
| 36 | /wp-includes/js/jquery/jquery.js | discovered on: /wp-admin/ | 200 |
| 37 | /wp-includes/js/underscore.min.js | discovered on: /wp-admin/ | 200 |
| 38 | /wp-includes/js/wp-util.min.js | discovered on: /wp-admin/ | 200 |
| 39 | /wp-includes/js/zxcvbn-async.min.js | discovered on: /wp-admin/ | 200 |
| 40 | /wp-content/themes/twentytwenty/assets/fonts/inter/Inter-italic-var.woff2 | discovered on: /wp-content/themes/twentytwenty/style.css | 200 |
| 41 | /wp-content/themes/twentytwenty/assets/fonts/inter/Inter-upright-var.woff2 | discovered on: /wp-content/themes/twentytwenty/style.css | 200 |
| 42 | /wp-admin/images/loading.gif | discovered on: /wp-admin/css/forms.min.css | 200 |
| 43 | /wp-admin/images/w-logo-blue.png | discovered on: /wp-admin/css/login.min.css | 200 |
| 44 | /wp-admin/images/wordpress-logo.svg | discovered on: /wp-admin/css/login.min.css | 200 |
| 45 | /wp-includes/fonts/dashicons.eot | discovered on: /wp-includes/css/dashicons.min.css | 200 |
| 46 | /wp-includes/fonts/dashicons.ttf | discovered on: /wp-includes/css/dashicons.min.css | 200 |
+----+----------------------------------------------------------------------------+----------------------------------------------------------+--------+
@leonstafford When I was first implementing the progress bar, I toyed with adjusting the progress bar total, but gave that up for some reason. Possible that the changes may now allow (or at least be easier to implement) progress bar adjustment.
If you're in a hurry to get this out, you can drop the progress bars, otherwise, I should be able to spend some time on this tonight and see what I can figure out.
Edit: I probably won’t get to it tonight: need rest. Should be done before Monday.
@crstauf no hurry, I'll have the UI showing some progress, just won't be a progressive %, due to fluctuating totals as new URLs are discovered. Will show what I get there and maybe you can jazz it up for CLI. I should get this PR merged in today/tomorrow, then we'll have some testing/adjusting time before release
@leonstafford It does not appear that the changes that the infinite-crawl
implements have made their way into the CLI generate
command. Is that intentional? I'm going to do my best to get things updated to use the new method, but I may get stuck and need support.
@leonstafford CLI progress bar has been implemented in infinite-crawl
branch on my fork; how do you want me to submit the changes?
I'm going to do my best to get things updated to use the new method
I think I got it figured... please check https://github.com/crstauf/static-html-output-plugin/commit/889ee5dc8f7966c3e3641b19121155ee1d12d1c2 for the adjustments.
@crstauf amazing! Thanks for dealing with the WIP branch!
If you're awake still and can PR that to the infinite-crawl branch, I'll merge it in, else I'll manually patch in from your branch, looks great!
@crstauf I'll then proceed today to work on the deployment side and can replicate your progress indicator there (I plan to change a few things to get deployment progress working in UI, so I wouldn't worry about it in CLI until that's done
@leonstafford I can submit a PR in ~2 hours. If you decide to do otherwise, lmk.
@leonstafford Actually, just figured it out with Working Copy.
Perfect, thanks!
closing this, separate issue/PR tracking CLI progress indicators #99
As referenced in #81 and #79
Issue surrounding detecting too much (ie, admin view CSS files, not used on site) and detecting too little (ie, WPML paginated URLs).
Ironically, normal full spidering was how this plugin used to function many years ago, more like SimplerStatic/SimplyStatic.
Not wanting to totally rearchitect this plugin (that's what WP2Static is for), a relatively simple adjustment should be made to continually crawl until all new discovered internal URLs have been crawled.
Previous shortcomings in the HTML/CSS parsing failed to detect all assets, which is where the greedy detection was a useful workaround, but should no longer be needed.
This proposed change is not the ultimate in elegance/extensibility, but should provide a good improvement without requiring too much effort.
flow
Note: crawl progress will be harder to measure in % terms, should shift to Detected vs crawled
Crawler knows to continue only by the crawl_queue not being empty.
Tasks
[x] don't detect plugin/theme assets to build initial crawl list
[x] new DB tables: crawl_queue, crawl_log (path, where detected, response status),
[x] detected URLs are written into both crawl_queue and crawl_log
[x] remove
crawl_again
task[x]
crawl
task to use crawl_queue[x] use fixed archive and zip name
[x] remove flaky detection of pagination URLs
[x] move URL lists away from flat TXT files into Database (pending deploy lists):
[ ] adjust progress indicators in UI and CLI (pending deploy / invalidation progress)
[x] prevent any query string URLs being detected (
Pending /?wp_block=untitled-reusable-block Note: initial_crawl_list
)[x] check options to ignore/delete DeployCache