NERC-CEH / dri_gridded_data

GNU General Public License v3.0
0 stars 0 forks source link

Add an end to end test #3

Open mattjbr123 opened 1 month ago

mattjbr123 commented 1 month ago

Is the output zarr dataset the same as the input netcdf dataset

Exact method to do this TBD

mattjbr123 commented 1 month ago

Ignore the above commits and comments, I think they've been assigned to the wrong issue...

mattjbr123 commented 3 weeks ago

We want to compare the data in the input netcdf file(s) to the output zarr dataset to ensure they are the same.

TL;DR

If doing fully/completely one major issue is the size of the datasets, potentially multi-TB. Would it be possible to do hashing or some other simple calculation to solve this? But this probably still has the issue of being a computationally expensive operation and needing to read in all the data anyway? This must be an issue the EIDC team face and have solutions for? @phtrceh would you be able to advise? Maybe we do just compare the datasets in chunks/slices. It'll still take a while but probably not as computationally expensive as calculating a parameter/hash from the data? Given we'd probably still want to use Beam to parallelise this as much as possible, we could build it in to the conversion pipeline itself somehow?

Then there is the issue of where do we want to run the test? If we want to run the test via GitHub Actions/CI, we'd need to link to whatever HPC or HPC-like environment we are running the conversion on and run there, unless we get the pipeline to calculate a number that represents the whole dataset somewhow for each dataset, and then the comparison is trivial and can run directly on a teeny tiny instance on GitHub.

Another issue is that we cannot store the data on GitHub as again it is too big. A potential way around this would be to upload the original and converted datasets to an object store and read them from there. We might have to find a way to safely store the credentials needed to access the object store, but I feel like this should have been a problem already solved elsewhere (e.g. the time series FDRI product?). Eventually we will not need to upload the converted data to object storage as a manual step as it will be done anyway as part of the Beam pipeline, but we need to not be using the DirectRunner for this, which involves creating a Flink or Spark instance for the Beam Flink or Spark runners to use.

Lots of questions!!