ECMWFCode4Earth / ml_drought

Machine learning to better predict and understand drought. Moving github.com/ml-clim
https://ml-clim.github.io/drought-prediction/
92 stars 18 forks source link

Review - Preprocessors #93

Open jwagemann opened 5 years ago

jwagemann commented 5 years ago
tommylees112 commented 5 years ago

Thanks for your questions!

The first thing to say is that all of these parameters are flexible and the pipeline allows you to specify each of them as you require. We have made some initial choices for our current experiments but these are only one set of parameter choices with the pipeline.

What resolution and extent did you use for the Unified Data Format?

We used an extent for kenya defined as this bounding box:

    Region(name='kenya', lonmin=33.501, lonmax=42.283,
                  latmin=-5.202, latmax=6.002)

The resolution we are currently using is ~5km but we are in the process of changing this for our own experiments.

The preprocessors use a reference grid, have you considered using epsg codes instead?

The reference grid is a previous .nc file that has the lat/lon resolution that the user is interested in mapping all other data to. We have not considered using epsg codes but would be interested to look at this if you have a python implemention of remapping netcdf files using epsg codes. Just to clarify we are not transforming data from different projections, but we are putting all data onto the same resolution.

What remapping method is used?

This can be flexibly specified by the user from the following (see here for explanations):

{'bilinear', 'conservative', 'nearest_s2d', 'nearest_d2s', 'patch'}

We used nearest_s2d

CDS longitude range is [0, 360], while many other data providers use the range [-180, +180]. Is the preprocessor automatically rotating the layers, if needed? Yes it is automatic

How do you define how to do spatial aggregations for different variables? For instance for temperature you might want to use the mean, for precipitation you might want to use the sum (if you are converting to a coarser resolution). We can only see a MeanAggregator.

we are using the mean as a first implementation of the pipeline. It would be a quick fix to change this if required. However, it is worth noting that since all values are normalized to have mean 0 and std 1 before they are input to the machine learning models, whether the data is aggregated with a sum or mean doesn’t make a difference to what the models see.