Open JacquesCarette opened 3 years ago
Sounds good. I think I can contribute some things that I already know about these steps.
This will primarily target the "Build" workflow.
Step | Approx. time spent | Discussion | |
---|---|---|---|
Install system requirements | ~2min | Downloads all system dependencies (compilers for artifacts, tex, fonts, etc). I couldn't figure out how to cache it properly without breaking post-install steps. | |
Install Stack | <1s | - | |
Cache dependencies | ~2min on cache hit, 1s else | The main issue with this is how often it doesn't get a cache hit. Needs investigation into why/why not. | |
Clean previous run | ~1min | Cleans the previous run, but Cabal also updates it's cache here -- this may need investigation into why it's not being cached too. | |
Update PATH | <1s | Adds Stack exe folder to $PATH | - |
HLint | ~50s | Downloads HLint (fast), and runs it (slow). I don't think there's much we can do here unless we want to just remove it from the "Build" workflow, and just leave it in the "Test" one. | |
Install dependencies | ~5min | Installs Haskell package dependencies. It will need investigation into which ones we need/which ones we don't. However, I know that Lens takes a large amount of time to compile (nowhere near 5min though!). | |
Build | ~3min | We will need to analyze our construction plan, we might be able to reduce it a bit by exporting more files in each package and via adding extra threads (though, GH Actions runner only have 2c/2t). | |
Stable | ~1min | We can merge this with the above, but it also does extra testing in "Build". Perhaps we can just remove it from "Build" in favour of leaving just in "Test"? | |
TeX | ~3min | We might be able to do them concurrently via scheduling them as background jobs. This might be a bit difficult to figure out which artifacts broke however. | |
Code | ~1min | Compiling code, might be able to do the same as "TeX". We might be able to remove it from "Build" entirely, leaving it only in "Test". | |
Doxygen | ~10s | - | |
Docs | ~5min | Takes a long time because it has to re-compile the originally compiled code with haddock docs (uses the originally compiled, so it's fast, most time is spend in Haddock generation), and then has to recompile (again, but from clean slate this time -- takes a long time) to generate the smaller version without all-exposed modules | |
Graphs | ~20s | - | |
Analysis | ~2.5min | @Ant13731 would probably be best to speak on this. | |
Build website | ~30s | Code is compiled alongside the "Build" step, and just executed here, so it's pretty fast. However, we might be able to make it a bit faster by removing file copying and making them build in-place (right now, we build artifacts in their own directories and then copy them over to a focal "deploy/" folder for the "Deploy" step) | |
Deploy | ~20s | Posts website to gh-pages branch (for GitHub Pages). Needs further investigation to see if we want to bring it down further. | |
POST Cache dependencies | ~1min | Alongside "Clean previous run", this can be investigated to see if we can strip it down further |
We might want to move some tests to just the "Test" workflow, we would just need to commit to which ones really.
Also, for the "Test" workflow, we can probably trim it down by ignoring the second "Haddock docs" build as the first one is fast and does the fully-exposed modules variant (which would catch all bad Haddock comments anyways).
Related to #2642 as well because we might need to rearrange some of our make targets to assist with this. I think there are a few things we can remove entirely (such as checking stack version, I think it happens automatically by Stack anyways).
analysis
takes a few minutes mostly because of the size and amount of dot graphs. The files (especially the lang
, code
, and gool
packages) take a while to compile. The dot graphs are also generated using two different algorithms: the circular-based one is a little easier to read for larger packages, while the normal dot algorithm shows the hierarchy a little nicer.
Same as above for the graphs
target, but we could definitely reduce it a bit. Right now it generates the module dependency graphs as svg
, pdf
, and png
files. The png
is needed to display the module dependency graph inside the website (the svg
graph was too big and didn't align properly). Right now, pdfs
are being used for the links to each graph, but I think it would be nice to have the svg
format also appear on the website. I'll have to see if the svg
versions work, otherwise we can just remove them from the graphs
target.
If you could actually do the investigation as to why some of the bits are slow, that would be good. At times, it might be worth posting bug reports in the appropriate place (like for Haddock). Going parallel isn't nearly as a good a solution as actually cutting the time down. Things like having cabal updating its cache seems like a waste of time, we should find a way around that. That our build takes 3 mins? That seems fair.
Mostly complete information on the dependencies (just need to check out containers
a bit more):
From Stackage:
nubSort
, which is effectively nub . sort
. Likely removable -- can be replaced with a function we export via drasil-utils
.lens
relies on it, we might as well use it too.splitOn :: Eq a => [a] -> [a] -> [[a]]
. Likely removable -- can be replaced with a function we export via drasil-utils
.mtl
, lens
relies on it, we might as well use it too.From Hackage:
encode
called on a String, likely removable.unicode-names
is outdated, it has a dependency on this packageExpected bundled with GHC (at least):
I thought there were pre-compiled versions of stuff on stackage? Yes, we need lens, but do we need to compile it every time?
I don't believe there are pre-compiled libraries (or exes) from stackage. It seems it's always just source code that we build locally and cache it in ./.stack-work/
(for local projects) and ~/.stack/
(for the dependencies from stackage). The CI currently tries to cache, and load the cache of its previous run on a branch. When it gets a cache hit (working cache load), we don't need to compile lens
+other deps. However, when the CI fails to get a cache hit (which often happens), we will have to download all the dependencies and recompile them. The cache hit frequency is of concern here because when we don't have a cache hit, the CI generally takes 10min longer than with a cache hit.
I'm not entirely certain why the cache hits fail as often as they do, but I suspect that one of the causes is due to failed CI runs. Whenever a CI loads a cache, the cache appears to be deleted. If the CI then continues to succeed, then the cache is re-placed back into GitHub cache storage. However, if the CI fails, the cache is never re-placed.
Does our code need to do anything explicit to do caching? What I mean is, the put the cache back, is there some stuff in our scripts to do that? If so, maybe it can be moved to after the successful build of the dependencies, rather than at the end?
That's interesing. I don't know if we can.
In this snippet, we setup the cache using actions/cache@v2
, and it places a POST-hook at the end of the entire CI job to place the cache back.
https://github.com/JacquesCarette/Drasil/blob/8bbae7e1dfa9a00d757c740a6f5f9724c9f1cdc0/.github/workflows/Test.yaml#L40-L49
I didn't find any documentation regarding moving the cache hook up in priority. It would definitely be good if we could do that! I'll try to look around.
Another issue that slows down deployment is that in the CI, the Clean previous run
(which just runs make clean
) will start updating the Cabal cache.
Example run where this occurs: https://github.com/JacquesCarette/Drasil/runs/3413942881
Maybe ideas cribbed from something like https://glebbahmutov.com/blog/parallel-cypress-tests-gh-action/ could help?
On cabal update: that seems like it could be a "stack clean" bug. It looks like the extra-deps causes stack to download stuff (it shouldn't), even though the source https://github.com/commercialhaskell/stack/blob/master/src/Stack/Clean.hs makes it clear that the first thing it's going to do next is to ignore all the extra-deps!
After reading around the GitHub Actions docs, it appears that we can only upload artifacts and download them from within the same workflow. I guess the feature is only meant for sharing it between jobs. So, it seems we're forced to be using the github cache mechanism. However, it does appear we can add backup restoration keys (e.g., instead of going for just "X", we can list "X,Y,Z" in order for it to look for caches).
Regarding the stack clean
, does it need to have the list of extra-deps paths before it runs that code? The stack clean
issue I brought up might've actually been a caused by the cache miss again. I think that CI run would've had to update the hackage somewhere anyway so that it can download multiplate
and the other extra-deps.
I think I've found one of the main culprits for the cache misses; caches aren't shared between workflows. Specifically, despite sharing the same keys, the caches in the Build
and the Test
workflows aren't shared. Source.
Caches are only shared via a single workflow, across branches in it's history. For example, if I have branch X
based on master
, then a workflow that runs on both master
and X
may share it's caches build for master
when running on branch X
. Additionally, if we have a branch Y
that branches off of X
, it may use the caches built from X
.
Since we currently have Build
which only runs on master
, and Test
which only runs on non-master
branches, when a non-master branch has it's first CI run, it doesn't use the caches built on Build
for master
. However, if we branch off said new branch, we may use the CI run caches in subsequent CI runs on the new branch.
Potential solutions:
Resolution | How it Works | Pros | Cons |
---|---|---|---|
Share the same workflow file | Since we would only have 1 workflow file, the main cache would be carried on master , and all branches based on it would be able to access it for their CI runs. |
Having differences between the steps in the "Build" and "Test" phases will be difficult/messy. The "Deploy" job/component might need to duplicate everything in the normal "test" and "build" workflow. |
|
Run the Test workflow on the master branch 1-2x/week(to generate a cache for Test on master ) |
We would build a cache on a weekly basis for the master branch on the Test workflow. Then, all branches based on master would be able to access a up-to-a-week-old cache (age doesnt matter here I think). |
* Contributes to another ticket where we run tests on master as well (#2698). |
* Arguably a hacky way to have caching on master for the Test workflow. |
Use a different workflow setup (e.g., master <- dev <- feature ) |
Build would remain the workflow for master , and Test would be the workflow that builds caches for dev and branches based on dev . |
Potentially good for release setup - Active development is made on dev branch, and whenever we feel we would want a tagged release of the software, we create a PR to master and issue a releaseI don't think the changes to the workflow files would be much, if anything. * Releases to master would happen infrequently. Assuming at least 1x/week, a cache hit should always occur for the Build workflow on master . |
Extra step in workflow setup. If no PRs are merged into dev at least 1x/week, then the cache on dev will be evicted and the timings will revert. |
Despite the first solution being my desired/"personally optimal" one, where we would just append extra steps to the "Test" workflow to make it a "Build" (essentially just releasing + pushing to GH Pages), it seems really difficult to do with the way GitHub Actions workflows work. So, I think we might need to just ignore that one as a possibility.
However, all 3 will still be imperfect solutions. Whenever a CI is running, if we trigger another CI over it, the cache consumed from the first run will be erased and there will be no cache for the second one. This is because we cancel the first CI before running the second (this is by choice, we don't actually need to do this, but it spares us build minutes), thereby cancelling the re-placement of the cache which is typically done at the end of a CI run.
Hmm, it seems I don't fully understand our full setup well enough to have an informed opinion on this. I think we might need to have a quick meeting to discuss, as that's likely to be a more efficient way to fill in my knowledge gaps.
Regarding Lens, which is a fairly large dependency that takes a lot of time to compile, there is a variant of it by the same authors called "microlens" which claims to be fully compatible. I'm not quite sure if suits our needs, but I recall @JacquesCarette mentioning that we only use a few core pieces from Lens. If this is the case, we might be able to use that library instead. This micro version appears to have much less dependencies too.
It most likely is compatible, given our fairly small use of features. Plus, we should have scheduled that meeting...
Yes, we should have that meeting. Is tomorrow at 11am/4pm alright?
Let's do 11am.
Sounds good.
With #2831 and #2833, we have many more cache hits in ours logs and less needless Haddock building. The build times are faster because of these changes.
Now, taking a look at https://github.com/JacquesCarette/Drasil/runs/3571294713 , I think we can search through the logs to look for errors to see for areas that could potentially be slowing us down. For example, there are LaTeX errors, 3 LaTeX compiles (I wonder if we are able to get away with just 2?) for each SRS, Haddock linking errors (warnings), and the Build
step currently uses "stack install" individually on each package (I think that we can change that to just 1 massive stack build
, might be helpful?). I'd also like to add a bit more logging to the "Analyze Drasil Code" step so that we can tell what's going on from the PRs. Is the "Analyze Drasil Code" step required for normal PR tests? Is it a program that can error out/cause problems if we were to remove it from the standard tests for PRs?
It would be nice if we could figure out a way to ignore building the TeX and software artifacts if we can figure out if the related codes didn't change in a PR/push.
Right now, the main thing that stands out to me is 'Analyze Drasil Code', which takes way longer than I expected. I wonder if it is inefficient.
Then indeed, the latex. A couple of days ago I discovered Ultrafast PDFLatex with precompiling which seems very promising!
Right now, the main thing that stands out to me is 'Analyze Drasil Code', which takes way longer than I expected. I wonder if it is inefficient.
With #2838, I realized that the majority of the time is spent in converting dot files into svgs. I see why we need the SVGs for the website, but do we need them for testing? Perhaps we can just generate the initial dot files (to make sure that our .hs files are still working), ignoring the conversion into different SVGs when we're only testing a PR's code?
Then indeed, the latex. A couple of days ago I discovered Ultrafast PDFLatex with precompiling which seems very promising!
Interesting, this sounds good!
I agree, for testing we should be able to generate the dot files, but not the rest, since that's not really testing Drasil per se. Generating the SVGs can be done solely for deploy. Not doing something is the best way to make things faster!
I don't think the build times haven't been very problematic recently, cache hits are frequent. The main issue with our Actions seems to be GitHub network errors (lol).
Should we close this issue in favour of taking a look at it again when the build times become a bottleneck again, @JacquesCarette ?
Regarding the dependencies, we can investigate that in #2698 because we'll need to re-build them anyways for it.
So are the times still roughly on par with the table at the top of this issue? I'd be curious to know where we are at, before deciding to not worry right now.
Also, I think I'd kind of like to be able to monitor how long our builds take. I know I can go look, but I think I'd like something more push-oriented, i.e. an email gets sent if the build takes more than 10% more time that its current average.
Unrelated to the discussion above:
I don't think we're going to get around everything that needs to be ultimately be done, but we might be able to make things faster on average by just re-organizing the steps of the workflow file. For example, if we know that we typically have HLint errors, then HLint should be the first thing done. If we know that compilation errors are also frequent, we can move compilation closer to the top. Similarly, if we know that we don't typically get to compiling TeX before the script errors out, then we can move the code that installs the TeX-related dependencies closer to the TeX steps.
Yes, making sure the build happens in 'optimal' order is a good idea. I think builds generally work, but hlint often fails. And good point about TeX.
We should remember @smiths' comment too. The TeX builds only need to run when the stable/**.tex
files are changed.
And also remember that in that same Issue, I pointed out 'precompilate' for LaTeX headers which might help too.
It might be worthwhile to revisit the table at the very top of this issue and make a new copy with current numbers.
I was looking at https://github.com/JacquesCarette/Drasil/pull/2760/checks?check_run_id=3323104120 "live", and some things really take a lot of time & I don't know why they do. For example, installing dependencies takes 5 minutes; the question would be, why are we asking for so many things that are not up-to-date in standard images?
Another is that it takes longer to build the docs than the code build itself, which seems quite odd. That warrants investigation too.