UMEP-dev / UMEP-processing

7 stars 9 forks source link

Remove temporary directory for SVFs #56

Closed jlegewie closed 6 months ago

jlegewie commented 6 months ago

Sorry but my previous PR requires another change: deleting the temporary folder. Otherwise, the user ends up with dozens or hundreds of temporary folders. Previously, the files were just overwritten because SOLWEIG always used the same locations (which was problematic when multiple instances run at the same time).

biglimp commented 6 months ago

Thank you for this correction. I will merge. A Question. What system/technique are you using to parallelize the model? Would probably be interesting for others as well.

jlegewie commented 6 months ago

I am using the HPC system at our university, which uses SLURM as a workload manager. I parallelize the model by using slurm job arrays. With some simplification, the basic steps are:

  1. Get comfortable submitting jobs via SLURM and learn about job arrays.
  2. Write script in python, R or your preferred language that invokes qgis_process run "umep:Outdoor Thermal Comfort: SOLWEIG" with appropriate settings.
  3. Modify your script to read the environment variable SLURM_ARRAY_TASK_ID. For example, in R you can task_id <- Sys.getenv("SLURM_ARRAY_TASK_ID") %>% as.integer() or in Python task_id = int(os.getenv('SLURM_ARRAY_TASK_ID')). task_id will take on a different value for each instance of the job (e.g. 1-200 if you submit a job array with the setting --array 1-200.
  4. Use the task_id variable to change SOLWEIG settings appropriately. For example, my overall raster is separated in over 200 tiles. task_id determines which tile is used for the input DSM, CDSM, DEM, wall height, wall aspect and SVFs, as well as the output folder.

I used the same approach to parallelize the computation of the wall height, wall aspect and SVFs for each tile.

biglimp commented 6 months ago

Thanks. At some point, it would be very interesting/useful if you could share your script for other users to see. More questions:

  1. Do you parallelize over space or time i.e. do you compute long time series or extensive areas?
  2. So these changes made in SOLWEIG are not needed for SVF and/or Wall height and aspect?
luise-wei commented 6 months ago

Hi, I was also working with SOLWEIG on an HPC + SLURM and from my experience I have another 5 cents to add on this topic.

Depending on how you setup the SLURM script, you might be able to create job and/or task specific directories and then run SOLWEIG within those newly created directories. I found this:

jobDir=Job_$SLURM_ARRAY_TASK_ID mkdir $jobDir cd $jobDir

here: https://docs.hpc.cam.ac.uk/hpc/user-guide/batch.html#array-jobs

It should also be possible to do that on the /scratch file system, right? I have learnt that it is good practice to create directories on scratch like this /scratch/<username>/<job_id>. So I would follow that and go with the pattern /scratch/<username>/<job_id>/<task_id>. And after the job finishes (as the final step of the script) you should also be able to copy the content (i.e. SOLWEIG results) from the /scratch system back to the persistent file system somehow (further details can be found e.g. here: https://hpc-unibe-ch.github.io/file-system/scratch.html#example-including-data-movement).

So: copy data to scratch > create task specific dir > run solweig > copy files back to local.

This approach targets at a higher directory level, so instead of having <result-folder>/<OUTPUT_DIR>/temp_12345678 it will rather be something like <result-folder>/<JOB_ID>/<TASK_ID>/<OUTPUT_DIR>/temp. Thereby this approach also avoids the situation where multiple instances write in the exact same directory.

Thinking about that, I was wondering whether your OUTPUT_DIR folder was also affected by overwriting.. Did you set task specific output directories, like _ ?

jlegewie commented 6 months ago

@biglimp : I parallelize over both space and time. 134 tiles by initially 150 days so SOLWEIG has to run 20,100 times and takes about 20min for each run. I will post some code later. However, a big part is setting up the HPC system with the correct software etc. For example, installing QGIS was a pain...

I didn't have the same problem with SVF and/or Wall height and aspect. I can easily change the output folder depending on task_id. The problem with SOLWEIG was that it extracts svfs.zip to the same internal location so if multiple instances are running at the same time, they overwrite the files in svfs.zip. SVF and/or Wall height and aspect do not rely on svfs.zip.

@luise-wei Good to hear that I am not alone! :) I just change OUTPUT_DIR as a function of task_id. In my case, OUTPUT_DIR is partly based on the tile and date so folder structure look like .../tile-1/2018-07-20/. Your approach works similarly. However, this does not solve the issue addressed by this pull request, which is unrelated to OUTPUT_DIR. SOLWEIG extracts svfs.zip to an internal, temporary folder.

luise-wei commented 6 months ago

@jlegewie Yes, I do see the issue with a single /temp, so I like the suggestion of adding random numbers to the folder name 👍

I'm surprised by the runtime of SOLWEIG you archieve on the HPC. May I ask what tile size, resolution and weather data frequency you use? For what I tested recently (not on a HPC, but workstation) even with a tilesize of 1500x1500 at 2m resolution and 24 hourly data points I only reached about 10-12 minutes for a SOLWEIG run, which was close to double the runtime of a 1000x1000 tile with also 2m resolution (around 4 minutes). So what I conclude from that is, that one has to find a good balance of tile size and overall runtime (e.g. too many small tiles are also probably not sensible). In case runtime becomes an issue and you didn't investigate further on this, it maybe makes sense to re-think the tiling setup.

On the QGIS pain: For my work I ended up changing some code to totally get rid of QGIS as a dependency for SOLWEIG, SVF and wall aspect/height calculations, because I wasn't able to setup a working gdal/qgis/python combination with what was available on the HPC I was working on.

jlegewie commented 6 months ago

@luise-wei : Getting rid of the QGIS dependency sounds great.

Tiles size is 1773 x 1771 at 2m resolution, which includes a 400m buffer zone that overlaps with adjacent tiles.

My guess is that the difference in runtime comes from the INPUT_ANISO option (see this article). Without anisotropic model for sky diffuse radiation, the run time is more like 5-10min and memory usage is lower as well.

The storage requirements are also significant and I made some changes to address that. Happy to share if that is an issue for you.

luise-wei commented 6 months ago

Ah interesting! Thanks for the hint. I didn't use the INPUT_ANISO option yet, so I wasn't aware. Storage is currently not an issue, but if I switch to use INPUT_ANISO, I'll reach out to you!