TODO for getting en22 done.

en22 Production pipeline

en22 Orchestrator

This is a piece of software that handles

Creation of train & test locations
Assignment of proper start & end dates (give 5 day buffer in both directions, i.e. 460 days total...)
run minicube creation for a given minicube idx i
run cloud mask creation for all minicubes (gets sen2 SCL, adds sen2flux, adds PC layers, aka the ALOS & COP30 DEMs...)
merge minicube + cloud mask, do temporal post-processing

en-minicuber

Handles download of 1 minicube....

TODO: Remove create minicubes from dataframe function???

workflow

Orchestrator defines the locations & split of en22
We run the cloud mask generation locally for all of them + save the resulting netcdf files into our nextcloud (or to AWS S3...)
Upload ERA5, Geomorphons, Soilgrids to AWS S3
On AWS: run minicube generation per minicube in parallel for many. Each generation gets the AWS data + merges with our cloud mask netcdf + saves into radientmlhub s3 bucket.

Running on AWS - Steps

[ ] Get Login credentials set up for EC2
[ ] Create AMI (machine image) with anaconda & python packages installed & save it for quick loading
[ ] Get one Instance
[ ] Figure out how access S3 storage from there
[ ] Run generate_en22 for one sample + save in S3 storage
[ ] Measure how long it takes + compute costs + how much data transfer costs
[ ] How to scale this up // launch many processes at once

AWS Cost Estimate

Can run 1 train minicube on t3.medium, but close to RAM limit (3.6/4GB) -> so to be safe: t3.large (8GB)
Takes ~10-15min/minicube on US-West2
or ~8-10min/minicube on AF-South1 -> ~1750Compute hours per 10k train minicubes
-> 0.0542$/h for t3.medium (4GB).. not suuper safe, but should be ok.. -> ~100$/10k train minicubes 1 minicube = ca. 20MB disk space -> 10k minicube = ca. 200GB -> outgoing data transfer 0.02$/GB -> 4$/10k train minicubes 1 test minicube = 3.5 train minicubes Overall: -> 100$/10k train minicubes -> 200$ for 5k test minicubes -> 1000$-1500$ in total??

TODO

[x] Sen2 SCL as back up cloud mask...
[x] check if can download cloud mask (sen2 SCL + sen2flux) seperately without bugs...
[x] revise minicube locations: look at jerans viz' if we like it or not. Also look at a Leaflet similar to https://cloudsen12.github.io/map/ with the minicube locations. Remove directly neighboring locations to have less spatial replication...
[x] split into train set & test set locations: first select test set locations. Then put buffer around them. Then select train set locations
[x] Train set sample starting week randomly
[ ] Implement proper timing for train (or any) samples: first sen2 obs: t=5. Last date: t=450. (-> i.e. don't really need the s2_avail mask)
[ ] Consider using a Global Lock for calls to rate-restricted APIs (GEE; PC...)
[x] Confirm with Fabian if can get "his" potential evaporation..
[x] Confirm with David about further sen2flux strategy
[ ] Benchmark the minicube generation code that should run on AWS: 1) Runtime 2) Data transfer 3) memory requirement
[ ] Estimate amount of S3 space needed: 1) era5,geoms, soilgrids 2) locally generated cloud mask netcdfs 3) final dataset
[ ] Learn what has to be done such that Code runs on AWS respective the different buckets DEAfrica // Element84 (+ planetary_computer + GEE).
[ ] Learn how to safe minicubes into S3 bucket
[ ] Learn how to move the ERA5 / GEOM / Soilgrids data into AWS for temporary usage (?)

earthnet2021 / earthnet-minicuber

en22 production #4