What does this PR do ?

GitHub Workflow:
- Uses caching by pulling from an image produced in main and from an image based of the current PR. The latter helps for cases where the PR bumps dependencies that diverge from main
- These caching strategies are not yet relevant since we only use a single builder node, once we scale to two nodes this will be highly benefitial given the expensive build time
- Since we want NeMo-aligner to be "up-to-date", i feed the current sha as build arg into the build
Dockerfile:
- To optimize layer caching I move build args closer to their usage

There's a lot to do or at least to experiment with to reduce build time and size. Some things I have in mind:

Moving less expensive installs like mcore, nemo, and nemo-aligner to the bottom
Parallelize with multi-stage: i think this is the most difficult since we need to know which files to copy over. But also highest gain in terms of size and build time optimization
Easier but also less impact: Reducing number of layers to reduce size (since we avoid duplicated compilers/build-tools that each overlayfs directory otherwise needs to track)

However, i don't think these optimizations are trivial so the next step would be focusing on test workflow before circling back to the build process.

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

# Add a code snippet demonstrating how to use this

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials