Investigate what takes so long in full builds

JacquesCarette commented 3 years ago

I was looking at https://github.com/JacquesCarette/Drasil/pull/2760/checks?check_run_id=3323104120 "live", and some things really take a lot of time & I don't know why they do. For example, installing dependencies takes 5 minutes; the question would be, why are we asking for so many things that are not up-to-date in standard images?

Another is that it takes longer to build the docs than the code build itself, which seems quite odd. That warrants investigation too.

balacij commented 3 years ago

Sounds good. I think I can contribute some things that I already know about these steps.

This will primarily target the "Build" workflow.

Step	Approx. time spent	Discussion
Install system requirements	~2min	Downloads all system dependencies (compilers for artifacts, tex, fonts, etc). I couldn't figure out how to cache it properly without breaking post-install steps.
Install Stack	<1s	-
Cache dependencies	~2min on cache hit, 1s else	The main issue with this is how often it doesn't get a cache hit. Needs investigation into why/why not.
Clean previous run	~1min	Cleans the previous run, but Cabal also updates it's cache here -- this may need investigation into why it's not being cached too.
Update PATH	<1s	Adds Stack exe folder to $PATH	-
HLint	~50s	Downloads HLint (fast), and runs it (slow). I don't think there's much we can do here unless we want to just remove it from the "Build" workflow, and just leave it in the "Test" one.
Install dependencies	~5min	Installs Haskell package dependencies. It will need investigation into which ones we need/which ones we don't. However, I know that Lens takes a large amount of time to compile (nowhere near 5min though!).
Build	~3min	We will need to analyze our construction plan, we might be able to reduce it a bit by exporting more files in each package and via adding extra threads (though, GH Actions runner only have 2c/2t).
Stable	~1min	We can merge this with the above, but it also does extra testing in "Build". Perhaps we can just remove it from "Build" in favour of leaving just in "Test"?
TeX	~3min	We might be able to do them concurrently via scheduling them as background jobs. This might be a bit difficult to figure out which artifacts broke however.
Code	~1min	Compiling code, might be able to do the same as "TeX". We might be able to remove it from "Build" entirely, leaving it only in "Test".
Doxygen	~10s	-
Docs	~5min	Takes a long time because it has to re-compile the originally compiled code with haddock docs (uses the originally compiled, so it's fast, most time is spend in Haddock generation), and then has to recompile (again, but from clean slate this time -- takes a long time) to generate the smaller version without all-exposed modules
Graphs	~20s	-
Analysis	~2.5min	@Ant13731 would probably be best to speak on this.
Build website	~30s	Code is compiled alongside the "Build" step, and just executed here, so it's pretty fast. However, we might be able to make it a bit faster by removing file copying and making them build in-place (right now, we build artifacts in their own directories and then copy them over to a focal "deploy/" folder for the "Deploy" step)
Deploy	~20s	Posts website to gh-pages branch (for GitHub Pages). Needs further investigation to see if we want to bring it down further.
POST Cache dependencies	~1min	Alongside "Clean previous run", this can be investigated to see if we can strip it down further

We might want to move some tests to just the "Test" workflow, we would just need to commit to which ones really.

Also, for the "Test" workflow, we can probably trim it down by ignoring the second "Haddock docs" build as the first one is fast and does the fully-exposed modules variant (which would catch all bad Haddock comments anyways).

balacij commented 3 years ago

Related to #2642 as well because we might need to rearrange some of our make targets to assist with this. I think there are a few things we can remove entirely (such as checking stack version, I think it happens automatically by Stack anyways).

Ant13731 commented 3 years ago

analysis takes a few minutes mostly because of the size and amount of dot graphs. The files (especially the lang, code, and gool packages) take a while to compile. The dot graphs are also generated using two different algorithms: the circular-based one is a little easier to read for larger packages, while the normal dot algorithm shows the hierarchy a little nicer.

Same as above for the graphs target, but we could definitely reduce it a bit. Right now it generates the module dependency graphs as svg, pdf, and png files. The png is needed to display the module dependency graph inside the website (the svg graph was too big and didn't align properly). Right now, pdfs are being used for the links to each graph, but I think it would be nice to have the svg format also appear on the website. I'll have to see if the svg versions work, otherwise we can just remove them from the graphs target.

JacquesCarette commented 3 years ago

If you could actually do the investigation as to why some of the bits are slow, that would be good. At times, it might be worth posting bug reports in the appropriate place (like for Haddock). Going parallel isn't nearly as a good a solution as actually cutting the time down. Things like having cabal updating its cache seems like a waste of time, we should find a way around that. That our build takes 3 mins? That seems fair.

balacij commented 3 years ago

Mostly complete information on the dependencies (just need to check out containers a bit more):

From Stackage:

composition - very small single file package, with no dependencies.
containers - TBD.
Decimal - Small library with 1 mid-sized dependency, used for only 1 function for printing uncertain values. Likely removable.
data-ordlist - very small package, but we only use it 1 for function nubSort, which is effectively nub . sort. Likely removable -- can be replaced with a function we export via drasil-utils.
directory - Small library, mid-sized dependencies, used for the SourceCodeReader and dot generators it appears. Seems ok to keep.
filepath - Lightweight dep, only relies on base. Seems ok to keep.
lens - Used everywhere, large package with a slow compile time due to TH. It would require a large effort to remove it, I don't see the need to do so.
mtl - lens relies on it, we might as well use it too.
pretty - Lightweight and heavily useful dep.
split - Lightweight dep, underused. Used only for 1 function splitOn :: Eq a => [a] -> [a] -> [[a]]. Likely removable -- can be replaced with a function we export via drasil-utils.
transformers - Like mtl, lens relies on it, we might as well use it too.

From Hackage:

json - used for 1 encode called on a String, likely removable.
multiplate - used in generating traceability maps, likely replaceable but appears lightweight, so it might not be worth the effort
unicode-names - outdated version of the software is up on Hackage, will need to contact the owner to make a new release, or to track their git repo
unicode-properties - because unicode-names is outdated, it has a dependency on this package

Expected bundled with GHC (at least):

base
time

JacquesCarette commented 3 years ago

I thought there were pre-compiled versions of stuff on stackage? Yes, we need lens, but do we need to compile it every time?

balacij commented 3 years ago

I don't believe there are pre-compiled libraries (or exes) from stackage. It seems it's always just source code that we build locally and cache it in ./.stack-work/ (for local projects) and ~/.stack/ (for the dependencies from stackage). The CI currently tries to cache, and load the cache of its previous run on a branch. When it gets a cache hit (working cache load), we don't need to compile lens+other deps. However, when the CI fails to get a cache hit (which often happens), we will have to download all the dependencies and recompile them. The cache hit frequency is of concern here because when we don't have a cache hit, the CI generally takes 10min longer than with a cache hit.

I'm not entirely certain why the cache hits fail as often as they do, but I suspect that one of the causes is due to failed CI runs. Whenever a CI loads a cache, the cache appears to be deleted. If the CI then continues to succeed, then the cache is re-placed back into GitHub cache storage. However, if the CI fails, the cache is never re-placed.

JacquesCarette commented 3 years ago

Does our code need to do anything explicit to do caching? What I mean is, the put the cache back, is there some stuff in our scripts to do that? If so, maybe it can be moved to after the successful build of the dependencies, rather than at the end?

balacij commented 3 years ago

That's interesing. I don't know if we can.

In this snippet, we setup the cache using actions/cache@v2, and it places a POST-hook at the end of the entire CI job to place the cache back. https://github.com/JacquesCarette/Drasil/blob/8bbae7e1dfa9a00d757c740a6f5f9724c9f1cdc0/.github/workflows/Test.yaml#L40-L49

I didn't find any documentation regarding moving the cache hook up in priority. It would definitely be good if we could do that! I'll try to look around.

balacij commented 3 years ago

Another issue that slows down deployment is that in the CI, the Clean previous run (which just runs make clean) will start updating the Cabal cache.

Example run where this occurs: https://github.com/JacquesCarette/Drasil/runs/3413942881

Screenshot from 2021-08-24 13-52-13

JacquesCarette commented 3 years ago

Maybe ideas cribbed from something like https://glebbahmutov.com/blog/parallel-cypress-tests-gh-action/ could help?

On cabal update: that seems like it could be a "stack clean" bug. It looks like the extra-deps causes stack to download stuff (it shouldn't), even though the source https://github.com/commercialhaskell/stack/blob/master/src/Stack/Clean.hs makes it clear that the first thing it's going to do next is to ignore all the extra-deps!

balacij commented 3 years ago

After reading around the GitHub Actions docs, it appears that we can only upload artifacts and download them from within the same workflow. I guess the feature is only meant for sharing it between jobs. So, it seems we're forced to be using the github cache mechanism. However, it does appear we can add backup restoration keys (e.g., instead of going for just "X", we can list "X,Y,Z" in order for it to look for caches).

Regarding the stack clean, does it need to have the list of extra-deps paths before it runs that code? The stack clean issue I brought up might've actually been a caused by the cache miss again. I think that CI run would've had to update the hackage somewhere anyway so that it can download multiplate and the other extra-deps.

balacij commented 3 years ago

I think I've found one of the main culprits for the cache misses; caches aren't shared between workflows. Specifically, despite sharing the same keys, the caches in the Build and the Test workflows aren't shared. Source.

Caches are only shared via a single workflow, across branches in it's history. For example, if I have branch X based on master, then a workflow that runs on both master and X may share it's caches build for master when running on branch X. Additionally, if we have a branch Y that branches off of X, it may use the caches built from X.

Since we currently have Build which only runs on master, and Test which only runs on non-master branches, when a non-master branch has it's first CI run, it doesn't use the caches built on Build for master. However, if we branch off said new branch, we may use the CI run caches in subsequent CI runs on the new branch.

Potential solutions:

Resolution	How it Works	Pros	Cons
Share the same workflow file	Since we would only have 1 workflow file, the main cache would be carried on `master`, and all branches based on it would be able to access it for their CI runs.		Having differences between the steps in the "Build" and "Test" phases will be difficult/messy. The "Deploy" job/component might need to duplicate everything in the normal "test" and "build" workflow.
Run the `Test` workflow on the `master` branch 1-2x/week (to generate a cache for `Test` on `master`)	We would build a cache on a weekly basis for the `master` branch on the `Test` workflow. Then, all branches based on `master` would be able to access a up-to-a-week-old cache (age doesnt matter here I think).	* Contributes to another ticket where we run tests on `master` as well (#2698).	* Arguably a hacky way to have caching on `master` for the `Test` workflow.
Use a different workflow setup (e.g., `master <- dev <- feature`)	`Build` would remain the workflow for `master`, and `Test` would be the workflow that builds caches for `dev` and branches based on `dev`.	Potentially good for release setup - Active development is made on `dev` branch, and whenever we feel we would want a tagged release of the software, we create a PR to master and issue a release I don't think the changes to the workflow files would be much, if anything. * Releases to `master` would happen infrequently. Assuming at least 1x/week, a cache hit should always occur for the `Build` workflow on `master`.	Extra step in workflow setup. If no PRs are merged into `dev` at least 1x/week, then the cache on `dev` will be evicted and the timings will revert.

Despite the first solution being my desired/"personally optimal" one, where we would just append extra steps to the "Test" workflow to make it a "Build" (essentially just releasing + pushing to GH Pages), it seems really difficult to do with the way GitHub Actions workflows work. So, I think we might need to just ignore that one as a possibility.

However, all 3 will still be imperfect solutions. Whenever a CI is running, if we trigger another CI over it, the cache consumed from the first run will be erased and there will be no cache for the second one. This is because we cancel the first CI before running the second (this is by choice, we don't actually need to do this, but it spares us build minutes), thereby cancelling the re-placement of the cache which is typically done at the end of a CI run.

JacquesCarette commented 3 years ago

Hmm, it seems I don't fully understand our full setup well enough to have an informed opinion on this. I think we might need to have a quick meeting to discuss, as that's likely to be a more efficient way to fill in my knowledge gaps.

balacij commented 3 years ago

Regarding Lens, which is a fairly large dependency that takes a lot of time to compile, there is a variant of it by the same authors called "microlens" which claims to be fully compatible. I'm not quite sure if suits our needs, but I recall @JacquesCarette mentioning that we only use a few core pieces from Lens. If this is the case, we might be able to use that library instead. This micro version appears to have much less dependencies too.

JacquesCarette commented 3 years ago

It most likely is compatible, given our fairly small use of features. Plus, we should have scheduled that meeting...

balacij commented 3 years ago

Yes, we should have that meeting. Is tomorrow at 11am/4pm alright?

JacquesCarette commented 3 years ago

Let's do 11am.

balacij commented 3 years ago

Sounds good.

balacij commented 3 years ago

With #2831 and #2833, we have many more cache hits in ours logs and less needless Haddock building. The build times are faster because of these changes.

Now, taking a look at https://github.com/JacquesCarette/Drasil/runs/3571294713 , I think we can search through the logs to look for errors to see for areas that could potentially be slowing us down. For example, there are LaTeX errors, 3 LaTeX compiles (I wonder if we are able to get away with just 2?) for each SRS, Haddock linking errors (warnings), and the Build step currently uses "stack install" individually on each package (I think that we can change that to just 1 massive stack build, might be helpful?). I'd also like to add a bit more logging to the "Analyze Drasil Code" step so that we can tell what's going on from the PRs. Is the "Analyze Drasil Code" step required for normal PR tests? Is it a program that can error out/cause problems if we were to remove it from the standard tests for PRs?

balacij commented 3 years ago

It would be nice if we could figure out a way to ignore building the TeX and software artifacts if we can figure out if the related codes didn't change in a PR/push.

JacquesCarette commented 3 years ago

Right now, the main thing that stands out to me is 'Analyze Drasil Code', which takes way longer than I expected. I wonder if it is inefficient.

Then indeed, the latex. A couple of days ago I discovered Ultrafast PDFLatex with precompiling which seems very promising!

balacij commented 3 years ago

Right now, the main thing that stands out to me is 'Analyze Drasil Code', which takes way longer than I expected. I wonder if it is inefficient.

With #2838, I realized that the majority of the time is spent in converting dot files into svgs. I see why we need the SVGs for the website, but do we need them for testing? Perhaps we can just generate the initial dot files (to make sure that our .hs files are still working), ignoring the conversion into different SVGs when we're only testing a PR's code?

Then indeed, the latex. A couple of days ago I discovered Ultrafast PDFLatex with precompiling which seems very promising!

Interesting, this sounds good!

JacquesCarette commented 3 years ago

I agree, for testing we should be able to generate the dot files, but not the rest, since that's not really testing Drasil per se. Generating the SVGs can be done solely for deploy. Not doing something is the best way to make things faster!

balacij commented 1 year ago

I don't think the build times haven't been very problematic recently, cache hits are frequent. The main issue with our Actions seems to be GitHub network errors (lol).

Should we close this issue in favour of taking a look at it again when the build times become a bottleneck again, @JacquesCarette ?

Regarding the dependencies, we can investigate that in #2698 because we'll need to re-build them anyways for it.

JacquesCarette commented 1 year ago

So are the times still roughly on par with the table at the top of this issue? I'd be curious to know where we are at, before deciding to not worry right now.

Also, I think I'd kind of like to be able to monitor how long our builds take. I know I can go look, but I think I'd like something more push-oriented, i.e. an email gets sent if the build takes more than 10% more time that its current average.

balacij commented 1 year ago

Unrelated to the discussion above:

I don't think we're going to get around everything that needs to be ultimately be done, but we might be able to make things faster on average by just re-organizing the steps of the workflow file. For example, if we know that we typically have HLint errors, then HLint should be the first thing done. If we know that compilation errors are also frequent, we can move compilation closer to the top. Similarly, if we know that we don't typically get to compiling TeX before the script errors out, then we can move the code that installs the TeX-related dependencies closer to the TeX steps.

JacquesCarette commented 1 year ago

Yes, making sure the build happens in 'optimal' order is a good idea. I think builds generally work, but hlint often fails. And good point about TeX.

balacij commented 1 year ago

We should remember @smiths' comment too. The TeX builds only need to run when the stable/**.tex files are changed.

JacquesCarette commented 1 year ago

And also remember that in that same Issue, I pointed out 'precompilate' for LaTeX headers which might help too.

It might be worthwhile to revisit the table at the very top of this issue and make a new copy with current numbers.

JacquesCarette / Drasil

Investigate what takes so long in full builds #2789