ErikKusch / KrigR

An R Package for downloading, preprocessing, and statistical downscaling of the European Centre for Medium-range Weather Forecasts ReAnalysis 5 (ERA5) family provided by the European Centre for Medium‐Range Weather Forecasts (ECMWF).
MIT License
103 stars 24 forks source link

download_ERA() creating many temporary files #62

Closed stephstewart02 closed 4 months ago

stephstewart02 commented 4 months ago

HI @ErikKusch and thanks for your work developing this. You package was recommended for use in tandem with another package to calculate growing degree days using ERA5-Land data. This requires hourly global data. As I understand it, this is the native temporal resolution of the ERA5 data, so I expected that there would be minimal processing necessary, but when I try to run my code, I noticed that hundreds of GBs of data are being saved out in the temp folder in the form of files with .gri and .grd extensions when I pull even 2 days of global data (I stopped the call after many hours and 350 GBs of temporary files). I am fairly novice to working with geospatial data, so any guidance would appreciated. The code I am running is as follows:

QS_Raw <- download_ERA( Variable = "2m_temperature", DataSet = "era5-land", DateStart = "1995-01-02", DateStop = "1995-01-03", TResolution = "hour", TStep = 1, Dir = Dir.Data, FileName = "QS_Raw_testv2", API_User = API_User, API_Key = API_Key )

ErikKusch commented 4 months ago

Hiya,

sorry to hear you are experiencing this issue. I have no idea what is happening there, I am afraid. Could you post the console output of the donwload_ERA function up until the creation of the many temp files starts?

Cheers, E

stephstewart02 commented 4 months ago

Hi Erik, I'm pretty sure all the temp files are creating when the aggregating starts. These are the lines I see in the console while all the temp files are being saved out:

`download_ERA() is starting. Depending on your specifications, this can take a significant time. User 295984 for cds service added successfully in keychain Staging 1 download(s). 0001_QS_Raw_testv2.nc download queried Requesting data to the cds service with username 295984

Checking for known data issues. Loading downloaded data for masking and aggregation. Aggregating to temporal resolution of choice`

I have tried this both on Windows and on Linux, and have the same issue. I had to cut it off after creating almost a TB of temporary files as I was worried it would use up all the storage space on a shared server. For me the temp files are all saved in "/temp/RtmpSfK59C/raster/".

ErikKusch commented 4 months ago

Alright! I believe I found the culprit. It is the conversion of raster objects to SpatRasters for the terra output I have implemented some time ago (https://github.com/ErikKusch/KrigR/pull/45). I am already working towards a new deployment of KrigR which gets rid of this step and expect this release to happen in the next two/three months.

Obviously, this does not solve your issue right now. So, here is what I suggest as a workaround:

  1. Stop the execution of download_ERA() after all raw data has been downloaded
  2. Load the data into R. In your case that would be: QS_Raw <- stack("/home/sf/l1sas04/Data/0001_QS_Raw_testv2.nc")

This will load the raw data downloaded from the CDS as a raster stack object. Would this be a viable solution for now?

I am afraid getting rid of the conversion step without bricking other essential functionality would be more hassle than it is worth at this point with a new release that addresses this issue anyways coming up.

stephstewart02 commented 4 months ago

Thanks for your speedy detective work! That works fine as a stopgap measure, but is there any additional formatting or manipulating that is done with the KrigR package? I ask because I have loaded in raw ERA5-Land data downloaded from the CDS API via Python using the stack function, but it isn't in a format that is compatible with a function I am trying to run in the other package I alluded to in my first message. The README for that package specifically says that the KrigR function gets the ERA5-Land data into an appropriate format, so I am trying to figure out how to get the raw data into the right format since I can't use the download_ERA function.

ErikKusch commented 4 months ago

The current version of KrigR only ensures that time components are saved properly to the time-slot in netCDF files. That being said, you aren't doing any temporal aggregation so this should not affect the files KrigR would produce if we didn't need to have the stopgap in place.

However, I think we can actually avoid the stopgap altogether now with the latest version of KrigR on the development branch. I just finished a first re-deploy of the download and temporal aggregation functionality there. Kriging is disabled on the development branch, but it sounds like you won't need it. You can install the development version like so:

devtools::install_github("https://github.com/ErikKusch/KrigR", ref = "Development")

Note that the new download function is called CDownloadS() there.

On to the formatting specifics for stagg - the documentation seems to specify a RasterBrick. KrigR produces SpatRasters. You can transform SpatRaster objects into RasterBrick objects doing so:

raster::brick(SPATRASTEROBJECT)
ErikKusch commented 4 months ago

Please let me know if this resolves your issue :-)

ErikKusch commented 4 months ago

Given the thumbs up on the proposed solution, I assume this has resolved your issue and I am closing this issue. Feel free to ping me again here to reopen it if you run into issues.