Azure / azure-sdk

This is the Azure SDK parent repository and mostly contains documentation around guidelines and policies as well as the releases for the various languages supported by the Azure SDK.
http://azure.github.io/azure-sdk
MIT License
487 stars 297 forks source link

Generation time makes it Windows and Windows+WSL local site generation unusable #7609

Open vhvb1989 opened 5 months ago

vhvb1989 commented 5 months ago

Run bundle exec jekyll serve --profile to measure the site generation time.

Takes a little more than a minute in Codespaces (Linux Debian 11) -> fast file system

Filename                                          | Count |     Bytes |   Time
--------------------------------------------------+-------+-----------+-------
_includes/releases/pkgtable.md                    |    29 | 25048.10K | 12.161
_includes/refs.md                                 |   467 |   886.11K | 11.731
_includes/releases/pkgrow.md                      |    29 | 24754.64K | 11.287
js/tipuesearch_content.js                         |     1 |  9139.52K |  8.810
_layouts/default.html                             |   612 | 60625.11K |  8.066
_includes/releases/notes/common.md                |   288 |  7582.63K |  7.970
_includes/releases/links.md                       |    29 | 16692.55K |  7.263
_includes/releases/languages.md                   |     4 | 11698.72K |  5.956
_includes/sidebar.html                            |   612 |  8729.39K |  4.985
_includes/releases/go.md                          |     9 |  6912.25K |  3.071
_includes/releases/dotnet.md                      |     9 |  5201.75K |  2.762
_includes/releases/java.md                        |     9 |  5180.01K |  2.613
_includes/releases/js.md                          |     9 |  4120.93K |  2.124
_includes/releases/python.md                      |     9 |  3374.67K |  1.806
releases/latest/index.md                          |     1 |  2928.03K |  1.561
_includes/releases/pkgbadge.md                    |    29 | 10190.98K |  1.526
index.md                                          |     1 |  2928.03K |  1.515
releases/latest/mgmt/index.md                     |     1 |  2928.03K |  1.501
releases/latest/all/index.md                      |     1 |  2928.03K |  1.493
_includes/topnav.html                             |   612 |  3233.24K |  1.477
_includes/releases/notes/java.md                  |    37 |  2894.90K |  1.232
_includes/releases/notes/js.md                    |    37 |   864.61K |  1.016
releases/deprecated/index.md                      |     1 |  2170.07K |  0.981
_includes/releases/notes/go.md                    |    33 |  1052.50K |  0.969
_includes/releases/notes/python.md                |    37 |   730.46K |  0.967
_includes/releases/notes/dotnet.md                |    37 |  1111.17K |  0.955
_includes/releases/notes/cpp.md                   |    32 |   161.82K |  0.828
releases/latest/all/go.md                         |     1 |  1889.80K |  0.819
releases/latest/all/dotnet.md                     |     1 |  1477.23K |  0.812
_includes/releases/notes/ios.md                   |    30 |   194.23K |  0.775
releases/latest/all/java.md                       |     1 |  1325.09K |  0.740
_includes/head.html                               |   612 |  1130.79K |  0.685
releases/deprecated/go.md                         |     1 |  1252.68K |  0.570
releases/latest/all/js.md                         |     1 |   987.06K |  0.558
_includes/releases/specs.md                       |     4 |  1370.33K |  0.487
_layouts/post.html                                |   506 | 15041.43K |  0.480
_includes/releases/notes/android.md               |    17 |   126.99K |  0.449
_includes/releases/notes/package_display_names.md |   260 |   413.20K |  0.395
_includes/releases/replace.md                     |    29 |  1398.03K |  0.381
releases/latest/all/python.md                     |     1 |   658.75K |  0.358
releases/latest/all/specs.md                      |     1 |  1143.64K |  0.320
releases/deprecated/dotnet.md                     |     1 |   492.11K |  0.279
releases/latest/mgmt/java.md                      |     1 |   341.08K |  0.243
releases/latest/mgmt/js.md                        |     1 |   380.84K |  0.230
_includes/releases/roadmap.md                     |    23 |    16.79K |  0.230
releases/latest/mgmt/go.md                        |     1 |   458.02K |  0.223
releases/latest/mgmt/python.md                    |     1 |   373.19K |  0.217
_includes/releases/notes/release_highlights.md    |   260 |  4879.54K |  0.208
releases/latest/mgmt/dotnet.md                    |     1 |   347.73K |  0.187
_includes/footer.html                             |   612 |   267.75K |  0.168

                    done in 68.125 seconds.

But, using Windows+WSL, it takes a little more than 20 minutes to generate the site!!!
I didn't even try it in Windows only.

The generation time is not related to CPU or memory, but to disk I/O

I can get the same time in Windows+WSL by cloning the repo within the WSL files, instead of mounting the Windows path into the WSL for the repo.

Should we consider removing or filtering data older than 2 years from the site?
Is there any value on listing and generating data since 2019 ? @weshaggard

weshaggard commented 5 months ago

I run on windows only and it hasn't traditionally taken 20 mins for me but I've not yet tried it after the ruby upgrade. I do suspect something in the ruby version upgrades are likely the cause as I was hitting similar issues in the past when trying to run on the later ruby version.

This is definitely worth further investigation but I'm not sure when I'll get around to it.

vhvb1989 commented 5 months ago

@weshaggard I did try with Ruby version 2.7 as well and got the same time results.

It seems to be a known issue for Jekyll to become slower as you keep adding more posts. I found a few blogs about how folks would eventually migrate from Jekyll to Hugo or Zola b/c of the generation slowness.

It seems like you can also set --incremental-build for Jekyll, which will only give you ~20 min the first time, but will be faster after that.

weshaggard commented 5 months ago

Fair. I do typically use --incremental but I don't remember it taking 20 mins initially. At some point I'll see what options we have. I'm definitely not opposed to pruning some of the older release stuff but we should consult with the PMs first. @ronniegeraghty do you know if there is any strong reason to keep all the release history?

ronniegeraghty commented 5 months ago

Everything in the releases/latest dir is used to give us the current state of our SDK inventory. It's used as the backend data for our inventory dashboard and I believe it's used for this page of our Release site. I believe the yyyy-mm directories are used for the monthly release notes pages on the Release site. Example May 2024 .NET Releases So to my knowledge both types are needed. We haven't discussed if the Monthly Release notes data needs to persist forever or can be cleaned very so often. I'll look into this.

Were there other files/directories you noticed slowing down the process?

vhvb1989 commented 5 months ago

Would you consider exploring Zola, @weshaggard ?

It might take some time to migrate it all, but would allow us not to remove old data (if we want to keep years of sdk data there). I'm happy to start a dev branch and see how it would look like to use Zola.

@ronniegeraghty , the top offenders are:

Filename | Count | Bytes | Time --------------------------------------------------+-------+-----------+------- _includes/releases/pkgtable.md | 29 | 25048.10K | ~4min _includes/refs.md | 467 | 886.11K | ~4min _includes/releases/pkgrow.md | 29 | 24754.64K | ~4min js/tipuesearch_content.js | 1 | 9139.52K | ~2min _layouts/default.html | 612 | 60625.11K | ~2min _includes/releases/notes/common.md | 288 | 7582.63K | ~2min _includes/releases/links.md | 29 | 16692.55K | ~2min

The content of those files go and iterate all the web content. For example, for _includes/releases/pkgtable.md creates a table looking at all the release history. The more releases we keep adding, the bigger that table becomes, and the more time it takes to generate.

For a non-static webpage (say WordPress), generation is requested on demand and paged.

There are actually 2 things to consider:

For the first one, we can consider cleaning release notes from past years, but, for the second one, we can't really reduce the number of libraries we ship 📦 . For example:

Look at the JS and .NET numbers per SDK release:

JavaScript
There are 925 total Azure SDK library packages published to npm

.NET
There are 972 total Azure SDK library packages published to NuGet from the [azure-sdk account](https://www.nuget.org/profiles/azure-sdk).

Considering 10 languages/ship-ably-libraries:

image

And thinking about ~500 libs in average (some languages like C have less). We would still be looking to around 5k libraries per release.

Jekyll has no option for doing parallel generation. It goes and creates one by one pages, running cycles with I/O to disk of 5k iterations.

The more libraries we add to the release (likely to keep happening), the more time it will take to generate the front page table: https://azure.github.io/azure-sdk/

So, another option to explore, is migrating the entire Web to a single-page-application with a backed api (Azure SWA could be a cheap option), or at least, updating the static html page to generate the html on the client-side based on json static files (that's what we do on Awesome azd ). The current approx size of the main html of azure-sdk root page is 2.89 MB , which, for an html (text file) is HUGE!! (github can't even display it)

ronniegeraghty commented 5 months ago

For the Package Table that shows on the Release Site, I believe this is using the _data/releases/latest files. Those files are not a record of every release but the latest data on each package. So, when a new version of the package is released, it doesn't create a new entry in the CSV, it just updates the version number. So, I don't think pkgtable.md's generation time depends on the number of releases we do, but the number of libraries we've released. And even when a package has been deprecated, I believe we need to show its status on the site, and we can't just remove it's entry.

For the Monthly Release notes, I could see us removing monthly release notes pages after a certain amount of time. (How long to wait before removing them is uncertain.)

weshaggard commented 5 months ago

Would you consider exploring Zola, @weshaggard ?

I'm not sure it is worth the effort, and I would prefer to stick with the standard github recommendation as it will likely remain working longer term.

It seems very odd to me that the common shared md files aren't cached in some way as they shouldn't be reading them off disk for everything like they seem to be doing. I wonder if there is some other option to enable caching or something.

That said @vhvb1989 if you want to take this on as a pet project go for it but I want to try and keep it as static as possible. At the end of the day I would also be fine saying you have to work in codespaces/devcontainer if that is what it takes to make it efficient.