OSGeo / grass

GRASS GIS - free and open-source geospatial processing engine
https://grass.osgeo.org
Other
853 stars 310 forks source link

[Bug] t.rast.aggregate fails when GUI Settings → Number of threads for parallel computing set to more than 1 with active processing mask #4708

Open Falconus opened 1 week ago

Falconus commented 1 week ago

Description

When attempting to run the t.rast.aggregate tool with a mask, it fails when the number of threads for parallel computing is set to >1 in the GUI settings (Settings → Preferences). When it is reset to 1 and saved, the tool works as expected with no errors. When the mask is removed, it also works with no errors.

t.rast.aggregate --overwrite input=uas_dsm@assignment5b output=uas_dsm_aggr basename=uas_dsm_aggr suffix=time granularity=1 months nprocs=1
WARNING: Parallel processing disabled due to active MASK.
Traceback (most recent call last):
  File "/usr/local/grass84/scripts/t.rast.aggregate", line
245, in <module>
    main()
  File "/usr/local/grass84/scripts/t.rast.aggregate", line
195, in main
    output_list = tgis.aggregate_by_topology(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/grass84/etc/python/grass/temporal/aggrega
tion.py", line 383, in aggregate_by_topology
    process_queue.put(mod)
  File "/usr/local/grass84/etc/python/grass/pygrass/modules/
interface/module.py", line 253, in put
    self.wait()
  File "/usr/local/grass84/etc/python/grass/pygrass/modules/
interface/module.py", line 311, in wait
    proc.wait(),
    ^^^^^^^^^^^
  File "/usr/local/grass84/etc/python/grass/pygrass/modules/
interface/module.py", line 859, in wait
    raise CalledModuleError(
grass.exceptions.CalledModuleError: Module run `r.series fil
e=/media/christopher/Data/GIS_584/GRASS/Lake_Wheeler_NCspm/a
ssignment5b/.tmp/christopher-desktop/771233.0 method=average
nprocs=2 memory=122880
output=uas_dsm_aggr_2015_09_01T00_00_00 --o --q` ended with
an error.
The subprocess ended with a non-zero return code: -11. See
errors above the traceback or in the error output.
(Fri Nov 15 23:55:38 2024) Command ended with non-zero return code 1 (2 sec)    

To reproduce

  1. Set a processing mask
  2. From the Settings drop-down menu, select the "Preferences" menu item.
  3. In the tools tab in the "GUI Settings" window, set the number of threads for parallel computing to a number greater than 1 (I tried 2 and 16, with the same results)
  4. Run the t.rast.aggregate tool. I used the following parameters: t.rast.aggregate --overwrite input=uas_dsm@assignment5b output=uas_dsm_aggr basename=uas_dsm_aggr suffix=time granularity=1 months nprocs=1. The nprocs flag had no effect, regardless of whether it was set at 1 or 16.
  5. Observe tool failure

Expected behavior

Tool should not fail due to mask if default nprocs are set to >1.

Screenshots

System description


System Info                                                                     
GRASS version: 8.4.1dev                                                         
Code revision: cd76f8b7d1                                                       
Build date: 2024-11-15                                                          
Build platform: x86_64-pc-linux-gnu                                             
GDAL: 3.9.3                                                                     
PROJ: 9.6.0                                                                     
GEOS: 3.12.2                                                                    
SQLite: 3.45.1                                                                  
Python: 3.12.3                                                                  
wxPython: 4.2.2                                                                 
Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.39                          
python3 -c import sys, wx; print(sys.version); print(wx.version())
3.12.3 (main, Sep 11 2024, 14:17:37) [GCC 13.2.0]
4.2.2 gtk3 (phoenix) wxWidgets 3.2.6

Workaround

Set the default nprocs to 1 or remove the mask for the t.rast.aggregate step.

veroandreo commented 4 days ago

Thanks for your report @Falconus. If you only use commands, i.e., set the nprocs parameter in t.rast.aggregate instead of via GUI, does it also fail? Would you mind creating a command line reproducible example with North Carolina dataset?

Falconus commented 4 days ago

Using the time series dataset from the Data page:

#Note: set default processors to >1 to test
t.info input=LST_Day_monthly@modis_lst

#Set region to maximum extent of LST_Day_monthly
g.region n=760180.124115 s=-415819.875885  e=1550934.464115 w=-448265.535885 -pa

#Probably not relevant, but I did this anyway
t.rast.series -n input=LST_Day_monthly@modis_lst method=count output=intersection

#Set region to subset
g.region n=550997 s=156914 e=626823 w=-56871

#Export region to polygon
v.in.region output=mask_area

#Reset region back to max extent
g.region n=760180.124115 s=-415819.875885  e=1550934.464115 w=-448265.535885 -pa

#Set mask from polygon
r.mask vector=mask_area@modis_lst

#The following two tasks fail. Note the granularity is set to 2 months, since 1 month doesn't do anything (nothing to aggregate)
t.rast.aggregate input=LST_Day_monthly output=LST_aggr basename=LST_ suffix=time granularity="2 months" method=average
t.rast.aggregate input=LST_Day_monthly output=LST_aggr basename=LST_ suffix=time granularity="2 months" method=average nprocs=1 --overwrite

#Remove mask
r.mask -r

#The following task succeeds without mask
t.rast.aggregate input=LST_Day_monthly output=LST_aggr basename=LST_ suffix=time granularity="2 months" method=average --overwrite 
petrasovaa commented 3 days ago

I think this is the same as is #4297. There may be other tools impacted.

ninsbl commented 3 days ago

Yes, I guess e.g. t.rast.series is affected the same way. But virtually every Python module that uses OpenMP parallelized modules under the hood would be affected.

Ideally, this is fixed in the Python modules, I guess. We could probably create a library function that:

  1. checks if a mask is present and deactivates OpenMP parallelism (if module is parallelized that way) or if no mask is present
  2. passes an "nprocs" parameter down to the parallelized modules or
  3. for temporal modules checks how many parallel module calls will be executed and then a. runs N number of modules in parallel if N module calls >= nprocs or b. distributes nprocs across inner processes (each single module call with nprocs > 1) and outer processes (N parallel module calls > 1) if N parallel module calls is < nprocs

Not sure if case b could be written to find an optimal balance between inner and outer processes.

We use a very simplistic approach for something like this here: https://github.com/NVE/actinia_modules_nve/blob/762f55bac991c1b5424e87d04340d435800c0b0c/src/temporal/t.pytorch.predict/t.pytorch.predict.py#L695

petrasovaa commented 1 day ago

I need to think this through more, but it seems to me there are 2 separate issues, one needs to be fixed in the C tools (#4297) and the other one is how to deal with the nprocs parameter in the python temporal tools (and there is also the environment variable).