dtcenter / METplus

Python scripting infrastructure for MET tools.
https://metplus.readthedocs.io
Apache License 2.0
97 stars 37 forks source link

Performance issue impacting WCOSS2 users #2363

Closed dkokron closed 12 months ago

dkokron commented 12 months ago

I am working with John Wagner to optimize his workflow on the WCOSS2 systems. The workflow makes heavy use of the stat_analysis utility. The workflow can run ~60 separate PBS jobs, each running ~70 separate instances of stat_analysis. With MET_TMP_DIR set to use a directory within the users Lustre area, I found that most of the stat_analysis processes are stuck in "uninterruptible sleep" with only a couple of process on each compute node making any progress at any given time.

Here are some typical stack traces seen while investigating (gdb -p ...). stat_analysis_traceback.txt

I used the strace utility to profile a single instance of stat_analysis to discover the following pattern of file system activity. stat_analysis_strace.txt

In summary, the strace data reveals a repeating pattern of

  1. attempting to access two files (met_config_48870_0 and met_config_48870_1)
  2. create one if it doesn't exist (usually met_config_48870_1)
  3. read some data from the one that did exist (threshold) then close it
  4. write that data to the other one then close that one
  5. reopen the new file, read the data we just wrote then close it
  6. Then after all this is done delete both files (unlink).

This pattern is repeated 8128 files for the 'po' element type.

Is this the expected behavior?!?!

Setting MET_TMP_DIR=/tmp (which is a node local RAM file system) allowed the entire workflow to complete in ~12 minutes while the usual MET_TMP_DIR setting results in runtimes of 4-5 hours. With 60*70=4200 of these processes attempting to run at the same time, the Lustre file system becomes bogged down for all users.

Is setting MET_TMP_DIR=/tmp the recommended solution to this issue?