Avoid loading whole images into memory in cast shadow

This refactor does not change any outputs.

Why?

Cast shadow calculations are not "local" (pixel-by-pixel), so they need to load surrounding pixels too
Because of the 15m panchromatic band in Landsat imagery this means about 7GB of in memory data
But both on AWS and NCI, the standard memory/CPU is about 4GB, which means we essentially pay twice the compute cost
Strangely, the underlying code was carefully written to operate on smaller slices but somehow something got lost in translation

Changes:

Emulate the outer for loop in the Fortran file cast_shadow_main.f90 in our Python code
[in Fortran for is called do because why not]
This lets us read only a horizontal slab and pass it onto cast_shadow_main.f90 which correctly handles it
[so the outer do loop there is now useless but 🤷🏽‍♂️]
Why not do it for the inner loop as well? Because for unknown reasons the zmax and zmin values are calculated per horizontal slab and not per inner blocks
Remove the Fortran code for non-UTM calculations that were not implemented fully anyway
This simplifies the Fortran function interface somewhat and lets us avoid useless spheroidal calculations (but that had little performance impact anyway)
Now the maximum resident memory is about 2GB!

Of course, we have to roll it out to AWS to actually save some money.

OpenDataCubePipelines / ard-pipeline