Open vhvb1989 opened 5 months ago
I run on windows only and it hasn't traditionally taken 20 mins for me but I've not yet tried it after the ruby upgrade. I do suspect something in the ruby version upgrades are likely the cause as I was hitting similar issues in the past when trying to run on the later ruby version.
This is definitely worth further investigation but I'm not sure when I'll get around to it.
@weshaggard I did try with Ruby version 2.7 as well and got the same time results.
It seems to be a known issue for Jekyll to become slower as you keep adding more posts. I found a few blogs about how folks would eventually migrate from Jekyll to Hugo or Zola b/c of the generation slowness.
It seems like you can also set --incremental-build for Jekyll, which will only give you ~20 min the first time, but will be faster after that.
Fair. I do typically use --incremental but I don't remember it taking 20 mins initially. At some point I'll see what options we have. I'm definitely not opposed to pruning some of the older release stuff but we should consult with the PMs first. @ronniegeraghty do you know if there is any strong reason to keep all the release history?
Everything in the releases/latest
dir is used to give us the current state of our SDK inventory. It's used as the backend data for our inventory dashboard and I believe it's used for this page of our Release site.
I believe the yyyy-mm
directories are used for the monthly release notes pages on the Release site. Example May 2024 .NET Releases
So to my knowledge both types are needed. We haven't discussed if the Monthly Release notes data needs to persist forever or can be cleaned very so often. I'll look into this.
Were there other files/directories you noticed slowing down the process?
Would you consider exploring Zola, @weshaggard ?
It might take some time to migrate it all, but would allow us not to remove old data (if we want to keep years of sdk data there). I'm happy to start a dev branch and see how it would look like to use Zola.
@ronniegeraghty , the top offenders are:
Filename | Count | Bytes | Time --------------------------------------------------+-------+-----------+------- _includes/releases/pkgtable.md | 29 | 25048.10K | ~4min _includes/refs.md | 467 | 886.11K | ~4min _includes/releases/pkgrow.md | 29 | 24754.64K | ~4min js/tipuesearch_content.js | 1 | 9139.52K | ~2min _layouts/default.html | 612 | 60625.11K | ~2min _includes/releases/notes/common.md | 288 | 7582.63K | ~2min _includes/releases/links.md | 29 | 16692.55K | ~2min
The content of those files go and iterate all the web content. For example, for _includes/releases/pkgtable.md
creates a table looking at all the release history. The more releases we keep adding, the bigger that table becomes, and the more time it takes to generate.
For a non-static webpage (say WordPress), generation is requested on demand and paged.
There are actually 2 things to consider:
latest release
For the first one, we can consider cleaning release notes from past years, but, for the second one, we can't really reduce the number of libraries we ship 📦 . For example:
Look at the JS and .NET numbers per SDK release:
JavaScript
There are 925 total Azure SDK library packages published to npm
.NET
There are 972 total Azure SDK library packages published to NuGet from the [azure-sdk account](https://www.nuget.org/profiles/azure-sdk).
Considering 10 languages/ship-ably-libraries:
And thinking about ~500 libs in average (some languages like C have less). We would still be looking to around 5k libraries per release.
Jekyll has no option for doing parallel generation. It goes and creates one by one pages, running cycles with I/O to disk of 5k iterations.
The more libraries we add to the release (likely to keep happening), the more time it will take to generate the front page table: https://azure.github.io/azure-sdk/
So, another option to explore, is migrating the entire Web to a single-page-application with a backed api (Azure SWA could be a cheap option), or at least, updating the static html page to generate the html on the client-side based on json static files (that's what we do on Awesome azd ). The current approx size of the main html of azure-sdk root page is 2.89 MB , which, for an html (text file) is HUGE!! (github can't even display it)
For the Package Table that shows on the Release Site, I believe this is using the _data/releases/latest
files. Those files are not a record of every release but the latest data on each package. So, when a new version of the package is released, it doesn't create a new entry in the CSV, it just updates the version number. So, I don't think pkgtable.md
's generation time depends on the number of releases we do, but the number of libraries we've released. And even when a package has been deprecated, I believe we need to show its status on the site, and we can't just remove it's entry.
For the Monthly Release notes, I could see us removing monthly release notes pages after a certain amount of time. (How long to wait before removing them is uncertain.)
Would you consider exploring Zola, @weshaggard ?
I'm not sure it is worth the effort, and I would prefer to stick with the standard github recommendation as it will likely remain working longer term.
It seems very odd to me that the common shared md files aren't cached in some way as they shouldn't be reading them off disk for everything like they seem to be doing. I wonder if there is some other option to enable caching or something.
That said @vhvb1989 if you want to take this on as a pet project go for it but I want to try and keep it as static as possible. At the end of the day I would also be fine saying you have to work in codespaces/devcontainer if that is what it takes to make it efficient.
Run
bundle exec jekyll serve --profile
to measure the site generation time.Takes a little more than a minute in Codespaces (Linux Debian 11) -> fast file system
But, using
Windows+WSL
, it takes a little more than 20 minutes to generate the site!!!I didn't even try it in Windows only.
The generation time is not related to CPU or memory, but to disk I/O
I can get the same time in
Windows+WSL
by cloning the repo within the WSL files, instead of mounting the Windows path into the WSL for the repo.Should we consider removing or filtering data older than 2 years from the site?
Is there any value on listing and generating data since 2019 ? @weshaggard