EcoExtreML / Emulator

Apache License 2.0
0 stars 1 forks source link

area across 0/360 degree longitude #23

Open QianqianHan96 opened 10 months ago

QianqianHan96 commented 10 months ago

Hi Francesco, @fnattino

I saw you keep [0,360] as longitude standard during preprocessing and inference. I also tried to do the same for other variables: SSM, CO2, IGBP, hc, Vcmax when I preprocess them. The only problematic one is SSM, because it is not global, it is just for Europe. I tried two ways to resample it.

1) convert era5land from [0,360] to [-180,180], then clip era5land using Europe boundary, then resample SSM, and then convert resampled SSM from [-180,180] to [0,360]. There is a weird line in the figure (cell 62 and 65). https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/4-SSM180.ipynb 2) keep era5land as [0,360], convert SSM from [-180,180] to [0,360]. Then I need to clip era5land with Europe boundary, but the clipped results is two parts across 0/360 degree longitude line, then I concat two parts into one, then when I plot it, it seems weird (cell 127). But based on this weird clipped era5land, I resampled SSM, it seems okay (cell 103) but when I export to tif and open in ArcGIS, it is still weird (longitude across 0-230 degree). https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/4-SSM360.ipynb

The resampled SSM from two ways display differently. But when I export the result to geotiff and open in ArcGIS, they are both wrong (seems like the third figure in the below figures). Do you have some experience about this? image

Then I tried the third way, same as the first way, but do not clip era5land

3) convert era5land from [0,360] to [-180,180], not clip era5land, then resample SSM, and then convert resampled SSM from [-180,180] to [0,360]. Then the result shows correctly and also correct in ArcGIS. https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/4-SSM180correct.ipynb 4) I also tried to not clip era5land for the second way, but the result is same as the above figure (third column one).

The another reason I am worried about this is when we do inference on bigger scale, if we need to select an area, we better avoid 0/360 longitude line? I also saw you select [0,125] degree in the inference notebook.

fnattino commented 10 months ago

Hi @QianqianHan96,

My reasoning behind keeping the ERA5 longitude in the [0;360] range, was that ERA5 was by far the largest dataset, so I thought it made more sense to adapt the other datasets to it. When I selected the range[0;125] degree in the inference notebook, I just wanted to test whether I could handle that spatial extent, and I did not really care about the specific region selected.

Concerning the problems you are observing at point 1., 2. and 4. above, I think they are all related to the same issue, which is the fact that your longitude coordinates of you final DataArray are not continuous (they somehow have a "gap"). When you bring a dataset that extends only over Europe and is in the [-180;180] range to the [0;360] range, you end up with a longitude coordinate array that looks something like: [0, 1, 2, ..., 48, 49, 50, 300, 301, 302, ..., 358, 359, 360]. The data per se is not wrong, but, when plotting it, you get the data value corresponding to x=50 repeated up to half of the missing range and the data value corresponding to x=300 repeated for the remaining part of the missing range. Also, saving an array with such coordinates in a GeoTIFF is likely giving rise to problems (this is why when opening the file again shows wrong data).

A possible solution for this could be to fill the array with a gap in the x coordinates with NaNs, you should be able to do it with reindex or reindex_like. They should lead to an DataArray with a "continuous" longitude coordinate (all steps from 0 to 360) and lots of NaNs for the values for which you had no coordinates..

QianqianHan96 commented 10 months ago

Hi Francesco, Indeed, it saved a lot of time if we make other datasets adapt to ERA5Land. In preprocessing part, if we convert era5land from [0,360] to [-180,180], also takes time.

Thanks a lot for your detailed explanation. It helps a lot. I understand now. When I convert [-180;180] range to [0;360] range in not global map, I need to make sure the converted result has continuous longitude. If it is global, it is continuous of course.

QianqianHan96 commented 10 months ago

Hi Francesco, I was just wondering is it possible to select longitude not continuous in one time, [0,67] and [330,360]? I also mentioned this problem to Sarah during today's group meeting. @fnattino @SarahAlidoost

fnattino commented 10 months ago

Hi @QianqianHan96, I am not sure I understand what you mean. Do you mean if it is possible to select the longitude ranges [0; 67] and [330;360] without dropping the coordinates in between?

Does the following do what you want?

import dask.array as da
import numpy as np
import xarray as xr

# data array with longitude in [0;360] range
tmp = xr.DataArray(
    data=da.random.random((180, 360), chunks=(45, 45)),
    coords={
        'lon': np.arange(360),
        'lat': np.arange(-90, 90),
    },
    dims=('lat', 'lon'),
)

# select longitude range [0;67] and [330;360], remaining is set to NaN
tmp_crop = tmp.where((tmp.lon < 67) | (tmp.lon > 330))

# plot
tmp_crop.plot.imshow()

image

QianqianHan96 commented 10 months ago

Hi Francesco, this is what I meant. But I just checked, in this way, the not selected pixels became nan, and the .nbytes/2**30 is same as the original global image. Then the computation is equal to run global data instead of Europe area?

If I use .sel, I will get not continuous longitude between 67 and 330, then I will see the weird plot again (cell 127 in https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/4-SSM360.ipynb). But this way, the data size is decreased.

The data size is shown below (small test on CRIB), era5land image in one timestep is tested. image

fnattino commented 10 months ago

But I just checked, in this way, the not selected pixels became nan, and the .nbytes/2**30 is same as the original global image. Then the computation is equal to run global data instead of Europe area?

Correct, it's quite wasteful. You could see whether the results makes sense when you run the resampling of SSM using the "non-continuous" coordinates (thus leaving a gap between 67 and 330). If results make sense, you can then fill in the gap in the longitude values using .reindex or .reindex_like at the very end, just before saving the data (I think the weird plots you got are only a visualization issue, but the underlying data might actually be correct). Alternatively, for Europe, you really have to move to a [-180;180] longitude range..

QianqianHan96 commented 10 months ago

Thanks for your answer. Another thing I have to be careful is the size of output data because of the storage on snellius and later I have to publish the data. So in this case, filling the gap when I save the data will make the output data bigger. Then I only have two options: 1) move [0,360] to [-180,180], in this way, not sure will the computation exceed. I will try. Otherwise, 2) run Europe in two times, first [0, 67], second [330, 360], save them into two files.

SarahAlidoost commented 7 months ago

Thanks for your answer. Another thing I have to be careful is the size of output data because of the storage on snellius and later I have to publish the data. So in this case, filling the gap when I save the data will make the output data bigger. Then I only have two options: 1) move [0,360] to [-180,180], in this way, not sure will the computation exceed. I will try. Otherwise, 2) run Europe in two times, first [0, 67], second [330, 360], save them into two files.

In branch fix_issue_23, I created a notebook that shows how era5land and ssm global data are cut via europe shapefile. The output data are xarray datasets that can be saved as NetCDF files and read by ArcGis. However, I have not tested it with ArcGIS. Please let me know if the implementation in the notebook fixes this issue.

QianqianHan96 commented 7 months ago

Thanks for your answer. Another thing I have to be careful is the size of output data because of the storage on snellius and later I have to publish the data. So in this case, filling the gap when I save the data will make the output data bigger. Then I only have two options: 1) move [0,360] to [-180,180], in this way, not sure will the computation exceed. I will try. Otherwise, 2) run Europe in two times, first [0, 67], second [330, 360], save them into two files.

In branch fix_issue_23, I created a notebook that shows how era5land and ssm global data are cut via europe shapefile. The output data are xarray datasets that can be saved as NetCDF files and read by ArcGis. However, I have not tested it with ArcGIS. Please let me know if the implementation in the notebook fixes this issue.

Hi Sarah,

Thank for your help. Indeed, if we convert [0,360] to [-180,180], then Europe will have continuous longitude. But the problem is, if we convert data from [0,360] to [-180,180], it takes time. In the preprocessing part, we save every data into zarr with [0,360] range, because ERA5 was by far the largest dataset, so we chose to adapt the other datasets to it. So if we convert [0, 360] to [-180,180] for every data in inference part, it will also take time.

Is there another way we can choose the whole Europe in [0, 360] range? I am curious how ERA5Land people do this.

In any case, only Europe and Africa are across 0° longitude. So not a big problem.

SarahAlidoost commented 7 months ago

Hi Sarah,

Thank for your help. Indeed, if we convert [0,360] to [-180,180], then Europe will have continuous longitude. But the problem is, if we convert data from [0,360] to [-180,180], it takes time.

Please note that the implementation in the notebook 1-ERA5-land.ipynb only saves one year of era5land data zarr file using a fat node and 16 cores on snellius and it still took about 1 hour.

However, see the notebook Cut_xarray_dataset.ipynb, I updated it using dask but on my local computer and only 5 cores. I used two files of era5_land data in this example. The notebook converts the longitude values, cuts the data and saves the data in zarr format. It took 38s for two files. As seen, this is a much faster approach than the one in notebook 1-ERA5-land.ipynb. Also, the zarr files are much smaller.

Two comments about xr.open_mfdataset:

QianqianHan96 commented 7 months ago

Hi Sarah, Thank for your help. Indeed, if we convert [0,360] to [-180,180], then Europe will have continuous longitude. But the problem is, if we convert data from [0,360] to [-180,180], it takes time.

Please note that the implementation in the notebook 1-ERA5-land.ipynb only saves one year of era5land data zarr file using a fat node and 16 cores on snellius and it still took about 1 hour.

However, see the notebook Cut_xarray_dataset.ipynb, I updated it using dask but on my local computer and only 5 cores. I used two files of era5_land data in this example. The notebook converts the longitude values, cuts the data and saves the data in zarr format. It took 38s for two files. As seen, this is a much faster approach than the one in notebook 1-ERA5-land.ipynb. Also, the zarr files are much smaller.

Two comments about xr.open_mfdataset:

  • the coordinates can be converted before opening the dataset with xr.open_mfdataset, see preprocess function in the notebook
  • no need to use glob for paths. xr.open_mfdataset can accept strings as"path/to/my/files/*.nc"

Thanks for your help, Sarah. Actually https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/1-ERA5-land.ipynb used 64 cores, but it saves one year of era5land data with 8 variables, which means 96 data files. And also it saved the global area (around 15 times larger than Europe). We want to run for global area in the end. So 96/2 15 = 720 times heavier than 2 files for Europe. Then 38 seconds 720 = 7.6 hours. But you used 5 cores, I used 64 cores. So maybe it will take 7.6 hour / (64/5) = 0.6 hour if it is linear. Or maybe it will take longer time.

But I am not sure is it better to rerun the preprocessing script again for all the input data for 10 years, because I keep [0,360] for all the input data (ERA5Land, LAI, SSM, CO2, landcover, canopyheight, vcmax) during preprocessing. Or just split Europe into two parts during inference.