davidfrantz / force

Framework for Operational Radiometric Correction for Environmental monitoring
GNU General Public License v3.0
173 stars 50 forks source link

libgomp: Thread creation failed #14

Open davidfrantz opened 4 years ago

davidfrantz commented 4 years ago

Reported by @jakimowb via email.

The Level 2 ImproPhe submodule in force-higher level occassionally throws this error:

________________________________________
Progress:                         26.00%
Time for I/C/O:           007%/092%/001%
ETA:             00y 00m 02d 21h 28m 59s
________________________________________
                   input compute  output
Processing unit:      27      26      25
Tile X-ID:            49      49      49
Tile Y-ID:            27      27      27
Chunk ID:             27      26      25
Threads:               8      22       4
Time (sec):          224    2747      33

libgomp: Thread creation failed: Resource temporarily unavailable
double free or corruption (!prev)
[1]    1148 abort (core dumped)  force-higher-level level2imp.prm.workaround/level2imp.prm
davidfrantz commented 4 years ago

There still is a general threading issue in force-higher-level.

It mostly surfaces when using the Level 2 ImproPhe submodule.

I guess it is related to the nested parallelism with OpenMP, wherein 3 teams are used to stream the data. The first team reads data from processing unit pu+1, the second team computes data in pu, and the third team outputs data from pu-1. The teams are working simultaneously. Each team can have multipe sub-threads to do the work parallely.

When doing the work sequentially, i.e. teams work sequentially, this issue does not appear.

I suspect that threads are not re-used and new ones are created instead, and that at some point, the maximum number of allowed threads on the system is reached. But this is only a suspicion..

Related to this: the memory footprint of the process keeps growing - which it doesn't when processing sequentially. I wasn't able to track down the problem. Memchecking with valgrind didn't show any memory leak.

jakimowb commented 4 years ago

So how to process the *.prm file sequentially? Do I need to change e.g

NTHREAD_READ = 8
NTHREAD_COMPUTE = 22
NTHREAD_WRITE = 4

to

NTHREAD_READ = 1
NTHREAD_COMPUTE = 1
NTHREAD_WRITE = 1

or should I just avoid to run force-higher-level with parallel, e.g.

`ls *.prm | parallel -j8 force-higher-level  {}

Please note that the error mentioned above occurred running force-higher-level with a single prm file.