falkamelung commented 3 years ago

Date: December 4, 2020 at 6:49:54 PM EST Reply-To: help@consult.tacc.utexas.edu

Hi Falk,

That job (6927882) only lasted for 37 seconds. There is not a lot of information we got from that job.

What is more important, we spent some time measure the filesystem load. Based on all the following runs here, we would suggest you restrict the number and size of the simultaneous of run_07 a little bit.

6918285@s2 s2-scratch 85.196 183.509 10385.717 tg851601 run_07_pai 2 6918278@s2 s2-scratch 81.413 172.133 11778.453 tg851601 run_07_pai 2 6918283@s2 s2-scratch 78.643 160.213 11472.952 tg851601 run_07_pai 2 6918284@s2 s2-scratch 68.734 141.068 9988.845 tg851601 run_07_pai 2 6918282@s2 s2-scratch 59.377 138.107 7155.432 tg851601 run_07_pai 2 6918280@s2 s2-scratch 57.908 113.568 8638.874 tg851601 run_07_pai 2 6918286@s2 s2-scratch 50.044 95.118 8542.089 tg851601 run_07_pai 2 6925936@s2 s2-scratch 3.466 20.599 14.457 tg851601 run_06_ove 1 6925937@s2 s2-scratch 3.121 27.089 14.532 tg851601 run_06_ove 1 6925935@s2 s2-scratch 3.071 24.544 13.811 tg851601 run_06_ove 1 6925934@s2 s2-scratch 3.029 24.564 13.254 tg851601 run_06_ove 1 6925893@s2 s2-scratch 0.040 0.007 0.129 tg851601 run_05_ove 1 6925891@s2 s2-scratch 0.038 0.007 0.079 tg851601 run_05_ove 1 6918074@s2 s2-scratch 0.029 22.768 78.487 tg851601 run_07_pai 2

Assume you are using all cores on each Stampede2 nodes here, please restrict the number of nodes working with jobs like run_07 to 10 nodes in all (like 5 2-node runs at any time). That should keep the filesystem in a good/stable status.

If you have any other jobs (like run07) with similar work IO load. Please keep them within the 10 node limits (10 nodes for all of your IO-intensive work). Then your workflow should be fine.

Please run under this workload level on Stampede2 with other settings we talked about before (the python_cache is still necessary; using stripe for large files). We will keep monitoring your Stampede2 runs in the following weeks. If you notice anything or have any potentially dangerous job, please feel free to contact us.

Frontera will be under maintenance and Texasscale runs in the following week. I will ask our system administrators to reactivate your account after the Texasscale week.

Best wishes, Si Liu TACC HPC

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The problem on Stampede/Frontera are "excessive OSS load" and/or "overload of the MDS from the launcher jobs". This means too many IO requests in one or more processing steps. Currently it is unclear whether the problem is caused by read or write requests and in which step the problem occurs. Once identified it will be easy to modify the workflow so that only a limited number of jobs of the offending processing step is run simultaneously.

A potential problem is that in several processing steps the same file is accessed by each launcher job. If 10 run_07 jobs are running simultaneously 480 jobs access the same file (the reference image).

On Frontera the workflow was killed during (likely) run_07 and run_13 making these the likely culprits.

I suspect the workflows were killed on Frontera because more jobs were running simultaneously compared to Stampede.

With TACC's help we need to monitor jobs from all processing steps to identify which step causes the problem.

The following steps have been cleared: run_05 run_06

Cleared? (Si needs to confirm) run_13 run_14 run_15

run_09 (partially cleared - there was no IO. I am not sure whether Si monitored the entire job as they take long)

Problems: run_07

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Nov 30 Hi Si plus team, Following my yesterdays email, could you please kindly monitor the jobs below? Some of them have not started yet. The first one is which I suspect is the offending job. Please let me know once you are done. After that I will send you the remaining jobs for monitoroing. In total there are 16 different run steps. With these once we will have done 9 of them. Thank you! Falk

       6918066  skx-normal run_07_p tg851601 PD       0:00      2 (Priority)
     6922299  skx-normal run_15_f tg851601 PD       0:00      8 (Priority)
       6922363  skx-normal run_13_g tg851601 PD       0:00      4 (Resources)
       6922277  skx-normal run_14_m tg851601 PD       0:00      4 (Priority)

I am looking at the job 6918066 right now. Roughly, this job is having <1k WR_MB and <2k Read_MB per second, which are OK.

The number of MDS IO requests has a peak number over 50k occasionally (That could be a bad thing if you run multiple similar jobs there.)

FILESYSTEM MDS/T LOAD1 LOAD5 LOAD15 TASKS OSS/T LOAD1 LOAD5 LOAD15 TASKS NIDS s2-home 2/2 0.21 0.26 0.29 1281 4/4 0.34 0.56 0.53 2167 5983 s2-scratch 4/4 13.69 10.50 8.90 1436 65/66 45.07 57.85 61.05 2258 6046

JOB FS WR_MB/S RD_MB/S REQS/S OWNER NAME HOSTS 6918066@s2 s2-scratch 629.657 1137.472 52181.687 tg851601 run_07_pai 2 6918066@s2 s2-home 0.000 0.000 0.258 tg851601 run_07_pai 2

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Hi Falk,

Our system administrator told me that they had to block your Frontera access again last night (due to excessive OSS load with MANY jobs running at one time), while we are still working on your Stampede2 workflow...

Could you please tell us what you are working with on Frontera? Is this a different project? If so, please let us know more details about your work.

Could you please also point us to your job script and input/output files?

Best wishes, Si Liu TACC HPC

Hi Falk,

I checked the notes and logs from our system administrators and consultants and here is a quick summary.

1) The issue about the login node download (so many wget) We have agreed that you can limit the number of simultaneous wget below 5. This should not be a concern anymore.

2) Your programs/scripts are working with many python libraries there. Please make sure you have "ml python_cacher" in your job script. That will help you release the pressure of repeatedly loading those python libs. This should not be a concern as long as you have the python_cacher module loaded.

3) As for the excessive OSS load: My first guess is that you are working with large files without striping.

Once you run a lot of programs simultaneously with the same file, or once your filesize is huge. The same OSS has to handle all the requests (without striping). It could trouble the filesystem.

Some of our early notes indicate that you have been striping your large files (in mid July). Did you forget to do it on the Frontera?

4) Another thing we need to worry about is the overload of the MDS from the launcher jobs (Once you have so many jobs raising IO requests in a short period of time, it may be a problem.) That is why I also suggested limiting the number of jobs until we believe they are safe

Right now, could you take a look at the jobs that were canceled last time and see what they are exactly and what they were doing there? If you believe it is run_07. I can take a deep look and may try it.

Let us try it gently on Stampede2 now. I do not think our Frontera admins can reenable your account before we make proper changes there.

Best wishes, Si Liu

falkamelung commented 3 years ago

January 5

Hi Falk,

1) Yes, you need to change the stripe count before you create/copy the file. That is how the lustre file system works.

2) If the files are small like 0.3GB, but the same files are reused so many times, you'd better just make multiple copies. ("striping" does not help much for small files.)

For the input file(s), you can just make some copies for each run (or maybe one extra copy for each node). One great trick is just making a copy of the input data file into /tmp before you start the heavy computation part (like one copy per node). The files kept in /tmp will not cause extra IO workload and it is also helpful to the performance. Please feel free to let me know if you have any related questions or concerns.

Best wishes, Si

Hi Falk,

Your Stampede2 account was blocked again.

It is very likely that: you are running parallel programs reading the same files at the same time. The fix should be easy. Please "stripe" the data files (mainly input files) using the "lfs setstripe" command as:

Create a new directory
set the stripe count to 8 or 16 by: lfs setstipe -c 16 mynewdir
make a copy of the datefiles to the newdirectory (striped)

That should resolve this problem. Please make the proper changes and I am asking the Stmapede2 admins to reenable your Stampede2 account.

BTW, our Frontera administrators have enabled your Frontera account. Please follow the early rules/restrictions you have used.

python cacher
number of 5 runs
also stripe the input data files on Frontera before you start to run

Thank you and best wishes, Si Liu

falkamelung commented 3 years ago

Januar 11 2021

Hi Falk,

We spent some time going through the system log and track what raised the problem. Here is a list of potential jobs that probably caused the heavy IO load on the system. These jobs caused the OSS problems (not MDS this time).

       JOBID   PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)           6719312  skx-normal run_09_f tg851601  PENDING       0:00   6:09:00      1 (Priority)
      6739304  skx-normal run_09_f tg851601  PENDING       0:00   5:09:00      1 (Priority)
      6739305  skx-normal run_09_f tg851601  PENDING       0:00   5:09:00      1 (Priority)
      6739306  skx-normal run_09_f tg851601  PENDING       0:00   5:09:00      1 (Priority)
      6739307  skx-normal run_09_f tg851601  PENDING       0:00   5:09:00      1 (Priority)
      6739309  skx-normal run_09_f tg851601  PENDING       0:00   5:09:00      1 (Priority)
      6739310  skx-normal run_09_f tg851601  PENDING       0:00   5:09:00      1 (Priority)
      6739311  skx-normal run_09_f tg851601  PENDING       0:00   5:09:00      1 (Priority)
      6739321  skx-normal run_09_f tg851601  PENDING       0:00   5:09:00      1 (Priority)
      6746003  skx-normal run_13_g tg851601  PENDING       0:00   1:02:00      2 (Priority)
      6746004  skx-normal run_13_g tg851601  PENDING       0:00   1:02:00      2 (Priority)
      6746005  skx-normal run_13_g tg851601  PENDING       0:00   1:02:00      2 (Priority)
      6746006  skx-normal run_13_g tg851601  PENDING       0:00   1:02:00      2 (Priority)
      6746007  skx-normal run_13_g tg851601  PENDING       0:00   1:02:00      2 (Priority)
      6746008  skx-normal run_13_g tg851601  PENDING       0:00   1:02:00      2 (Priority)
      6746009  skx-normal run_13_g tg851601  PENDING       0:00   1:02:00      2 (Priority)
      6719311  skx-normal run_09_f tg851601  RUNNING    1:51:53   6:09:00      1 c492-114
      6719310  skx-normal run_09_f tg851601  RUNNING    1:52:54   6:09:00      1 c492-104
      6719309  skx-normal run_09_f tg851601  RUNNING    2:02:34   6:09:00      1 c497-032
      6745872  skx-normal run_06_o tg851601  RUNNING       0:00   1:44:00      1 c506-133
      6745871  skx-normal run_06_o tg851601  RUNNING       1:53   1:44:00      1 c506-072
      6745869  skx-normal run_06_o tg851601  RUNNING      21:24   1:44:00      1 c506-082
      6745870  skx-normal run_06_o tg851601  RUNNING      21:24   1:44:00      1 c506-093
      6746002  skx-normal run_13_g tg851601  RUNNING      31:33   1:02:00      2 c478-132,c485-01

What we can see is that: Some of your launcher jobs like this one:

/scratch/05861/tg851601/NorthAnatoliaSenAT14/run_files/run_13_generate_burst_igram_0.job

were running 90 parallel file processing jobs and were driving a high load on our OSS servers. My feeling is that these files are using (frequently reading) the same input files. Please take a look at these jobs (run 13) again.

If the input files are small, I would suggest you make a copy of the required file to /tmp (one per node) before you start the run. Then all jobs can use the copied file on the local disk (/tmp) without working with the global filesystem.
If the input files are large, you need to stripe these files.

If you have any other questions or concerns, please let me know.

Best wishes, Si Liu

falkamelung commented 3 years ago

January 20 2021

Hi Falk,

There are two typical IO workload issues.

1) Heavy MDS workload. It is normally caused by the high-frequency IO requests (like thousands of file open/close/stat in a short period of time). You should use python_cacher and OOOPS to help.

2) Heavey OSS workload That is normally caused by the same file loaded by so many processes at the same file (like an input file loaded by binaries running with each task). Make a copy of the file under /tmp on each compute node (Therefore you have one local copy for shared by 48 (or fewer) tasks, instead of one global file shared by all tasks).

I am writing a program for users to distribute/collect files /tmp. Let me send it to you after lunch.

Best wishes, Si Liu

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Hi Falk,

Here is the program to collect/distribute files from/to /tmp. Hopefully, this will help.

A) To set it up:

Please load the latest intel/19 compilers for best performance.

export CDTOOL=/scratch1/01255/siliu/collect_distribute #Frontera

export CDTOOL=/scratch/01255/siliu/collect_distribute #Frontera

export PATH=${PATH}:${CDTOOL}/bin

B) Distribute your files to /tmp by:

distribute.bash ${CDTOOL}/datafiles/inputfile distribute.bash ${CDTOOL}/datafiles/inputdir distribute.bash ${CDTOOL}/datafiles/outputfile distribute.bash ${CDTOOL}/datafiles/outputdir

C) Collect your file from /tmp by:

collect.bash /tmp/outputdir ${CDTOOL}/datafiles/new_output_collected collect.bash /tmp/outputfile ${CDTOOL}/datafiles/new_output_collected

This version should work for either file or directory.

1) If the number of files is not large, you can distribute the whole directory directly. 2) We are thinking of the distribute+untar and let me try to implement it this week.

This is a new program, please use it carefully (make sure you have the files under /tmp after the distribution work). If you notice any further issues, please feel free to let me know.

You may have some issues with "srun tar -xvf /tmp/test.tar" if you have multiple tasks per node. It may run more than once per node.

Best wishes, Si Liu TACC HPC

falkamelung commented 3 years ago

Jan 21

I have asked our system administrators to reactivate your account.

--- But please make sure that you have the setup we discussed before (ooops, python_cacher, striped large files, etc.). --- Meanwhile, please copy/distribute the required files to /tmp on each node in advance. This should help you avoid the unnecessary OSS load. You can do it either in your "sbcast" way or use the tool I sent you earlier.

I will let you know when our system administrator has an update.

BTW, please try to reply to ticket 65018 for Stampede2 issues and ticket 65829 for Frontera issues in the future (This ticket is the Stampede2 one). It will be easier to track and follow.

My collect-distribute program is built with intel19 for the best practice and performance. If you need to run it now, please use intel19 before running the "distribute" program. I may hard-code the intel19 and module usage “by force" (maybe next week).
You should always have module load ooops and python cacher in your SLURM job script.
what is the current limitation you used for your Stampede2 work. I remember 5 node and 10 nodes. What do you plan for Frontera?

Please not: your IO work has caused file system issues multiple times for various reasons on both Stmapede2 and Frontera. Our system administrators are really worried about the IO work here. Please use the tools we provide and be very careful about the IO work. If you are not sure whether some work is going to be fine, please feel free to work with us before you really run it in a productive way.

falkamelung commented 3 years ago

Initial queue access to Stampede2 and Frontera was blocked because of too many simultaneous wget processes and because we did not use python_cacher. More recent blocks were because of too heavy MDS and OSS workloads of some of the computing steps (we have a total of 11 different run steps).

The following has been implemented:

wget process has been limited to 5 simultaneous downloads
now we use python_cacher and ooops.
the computing step that caused the heavy MDS load was identified and removed from the workflow (run_07 pairs_misreg step, it was not critical as used for 'coregistration=NESD' only).
the heavy OSS load was caused by all launcher tasks reading from the same file on /scratch. To avoid this this file is now copied into local /tmp for accessing (we load at load as a function of launcher tasks and not nodes because the number of simultaneously running tasks varies because of varying memory requiurements)
as each launcher task accesses /scratch, we have created a job submission script that limits the number of simultaneously running launcher tasks. It submits a new job only if the number of tasks will remain below a pre-set limit. If there are too many active tasks it will wait and try to submit a few minutes later again. We currently have limits of 3000 total tasks (any step) and 1500 tasks for each step (we run several workflows simultaneously). If we can find out which run steps are creating the heaviest loads I could put stricter limits on those steps ( e.g. 500 tasks or 10 nodes) and relax the limits for steps that are of no concern (e.g. 5000 tasks).

Striping. We have not implemented as our files don't go over 0.3 GB.

Example job file:

#! /bin/bash
#SBATCH -J run_04_fullBurst_geo2rdr_0
#SBATCH -A TG-EAR200012
#SBATCH --mail-user=famelung@rsmas.miami.edu
#SBATCH --mail-type=fail
#SBATCH -N 1
#SBATCH -n 48
#SBATCH -o /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0_%J.o
#SBATCH -e /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0_%J.e
#SBATCH -p skx-dev
#SBATCH -t 0:06:00

module load launcher
export OMP_NUM_THREADS=4
export PATH=/work/05861/tg851601/stampede2/test/code/rsmas_insar/sources/isce2/contrib/stack/topsStack:$PATH
export LAUNCHER_WORKDIR=/scratch/05861/tg851601/unittestGalapagosSenDT128
export LAUNCHER_PPN=6

export LAUNCHER_JOB_FILE=/scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0
module load python_cacher 
export PYTHON_IO_CACHE_CWD=0
module load ooops

### copy infiles to local /tmp and adjust xml files ###
export CDTOOL=/scratch/01255/siliu/collect_distribute
module load intel/19.1.1
export PATH=${PATH}:${CDTOOL}/bin
distribute.bash /scratch/05861/tg851601/unittestGalapagosSenDT128/reference

distribute.bash /scratch/05861/tg851601/unittestGalapagosSenDT128/geom_reference
files="/tmp/*reference/*.xml /tmp/*reference/*/*.xml"
old=/scratch/05861/tg851601/unittestGalapagosSenDT128
srun sed -i "s|$old|/tmp|g" $files

$LAUNCHER_DIR/paramrun

falkamelung commented 3 years ago

Feb 1 Hi Falk,

1) For your $WORK usage: The problem is that some of your jobs were causing problems with too many requests to /work with python. Our system admins have not reported the improper usage of login nodes. But let us keep such jobs running on compute nodes any way.

Please create a job script for this work with ooops and python_cacher. Point me to the job script and maybe give me a description of what this job will do there.

I am not sure how hard to run this program under our account (instead of yours). If it is not too hard, we'd like to run it on our side first.

2) As for your job script "which runs every 30 seconds an sacct process and every 5 minutes ~20 scontrol processes.": This workload should be OK, but if you run such slurm commands every second (or you run it with too many tasks), there could be another problem.

Best wishes, Si

Jan 22: Hi Falk,

FYI, our Frontera/Stampede2 administrators told me that your account is active on both systems at this time.

Please follow the early rules/restrictions you have used.

OOOPS and python cacher
stripe the input data files on Frontera before you start to run
distribute your input files to /tmp in advance with the tools I sent you
number of runs We do not know the exact numbers you may run on the Frontera. You can start with 5 runs as you did with Stampede2. We can help you monitor the jobs and the workload. Please do not run heavy jobs before Monday. We can try to run and monitor some jobs together during business time next week if you like.

For the OOOPS issue:

Considering your IO workload, it is required to have both OOOPS and Python cacher within your workflow. Without OOOPS, a lot of MDS requests will be raised in a short time and the filesystem will be in trouble.
As for the 2 minutes->16 minutes behavior: That is exactly how OOOPS works, it will add extra "sleep" once the IO work or MDS request is too high. A 2-16 minutes delay means OOOPS is doing its job. A longer delay proves that you have spent too much time on the IO work. (If your programs are busy with computation, OOOPS will not affect it much.)
If you need to increase the performance, you need to redesign your IO workflow (maybe decrease the amount or frequency of IO requests or use parallel IO). Right now, it can be seen that you are spending too much time on the IO work instead of computation, which should be optimized.

Best wishes, Si Liu TACC HPC

falkamelung commented 3 years ago

Feb 4

Hi Falk,

1) I had a conversation with our admins team. Your IO job is a big concern to them for a while. They have spent a lot of time and effort on fixing the problems caused by your early jobs.

Let us try to run your jobs only one at a time to make sure everything is fine.

2) The latest problem is about the high workload on $WORK from Python. I think you should have loaded the python_cacher module.

Please put OOOPS and python_cacher module in your job script and run your jobs ONLY on compute node. We will monitor these jobs closely for you and see why it still raised high workload with $WORK. We can start with the one you ran on login nodes earlier.

3) I also checked some running report of the jobs you mentioned earlier (those 3-node launcher jobs).

For these jobs, we can find the system counts here: https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstats.stampede2.tacc.utexas.edu%2Fmachine%2Fjob%2F7184630%2Fmdc%2F&data=04%7C01%7Cfamelung%40rsmas.miami.edu%7C636ed2cef20c47290eb208d8c923da50%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637480503113277869%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XL49okvDT7l8NbuRVgwD0pXO4zWHChussnX%2BRweELxQ%3D&reserved=0 https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstats.stampede2.tacc.utexas.edu%2Fmachine%2Fjob%2F7186458%2Fmdc%2F&data=04%7C01%7Cfamelung%40rsmas.miami.edu%7C636ed2cef20c47290eb208d8c923da50%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637480503113287864%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=T%2FWr7tPt3RwQs5nzkyXP9YUi24tTwTWN6hL6aSzXy0E%3D&reserved=0 https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstats.stampede2.tacc.utexas.edu%2Fmachine%2Fjob%2F7186459%2Fmdc%2F&data=04%7C01%7Cfamelung%40rsmas.miami.edu%7C636ed2cef20c47290eb208d8c923da50%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637480503113287864%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=v1QZPTcwD4i7BKzTelRLflHVOUa4oMbAZbRwXaVvIDo%3D&reserved=0

There is a significant high MDC wait (from the bottom pictures of the report) within the first few minutes of each job: about 50000+ wait * 3 node for each job.

I guess multiple runs in your 3-node launcher jobs are competing for the same input files on $SCRATCH. You do need to distribute the required input files at the beginning for this kind of IO pattern.

4) There are too many tickets for your IO issues. Please just reply to 65018 or 65829 and I am going to close others.

67762 #65018 Stampede2 queue access still suspended internal_wait 67529 Fwd: TACC Consulting #65829 frontera IO work new 67514 TACC Consulting #65829 frontera IO work open 65829 Frontera IO work (tg851601) open 65018 Stampede2 IO work (tg851601)) open

Best wishes, Si Liu

KokoxiliBigChunk36SenAT41/run_files/run_08_generate_burst_igram_6_7184630.e
KokoxiliBigChunk39SenAT41/run_files/run_08_generate_burst_igram_7_7186458.e
KokoxiliBigChunk39SenAT41/run_files/run_08_generate_burst_igram_8_7186459.e

falkamelung commented 3 years ago

Feb 5 Hi Falk,

From my record, the last time our admins blocked you on Stampede2 because some of your jobs raised high IO requests with Python modules.

1) If you have OOOPS and Python_cacher loaded, it should not happen (theoretically). Maybe you have some cases running without Python_Cacher. So let us retry it to confirm.

2) If you have distributed input files to /tmp, you should be fine. I do not think our admins blocked your launcher jobs for the number of tasks at this time. But we will run and monitor those lanucher jobs with you too.

3) If possible, you should consider using /tmp as much as possible. Though OOOPS can protect the system, it may make the program run much longer as you already seen. Using /tmp more could help you improve the performance.

Our system admins worries more about the stability of the filesystem. Using OOOPS and Python_cacher on your job is good enough to them. But for you, the overall cost and performance should be a concern.

This week, our system administrators are busy with the Texascale runs and the coming maintenance. Please prepare the job script and we will run it with you together next week.

Best wishes, Si Liu

falkamelung commented 3 years ago

Feb 10:

Hi Falk,

It is not about the test programs.

It looks like you submitted so many runs again. Here is what I can see now.

7265480.bat+ batch tg-ear200+ 96 COMPLETED 0:0 7265480.0 sed tg-ear200+ 96 FAILED 2:0 7265481 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265481.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265481.0 sed tg-ear200+ 96 FAILED 2:0 7265482 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265482.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265482.0 sed tg-ear200+ 96 FAILED 2:0 7265483 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265483.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265483.0 sed tg-ear200+ 96 FAILED 2:0 7265484 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265484.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265484.0 sed tg-ear200+ 96 FAILED 2:0 7265485 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265485.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265485.0 sed tg-ear200+ 96 FAILED 2:0 7265532 run_11_un+ skx-normal tg-ear200+ 96 CANCELLED+ 0:0

You submitted so many of them and we do not get any notification about them at all. Our system admins had to kill all of the jobs and worked on the filesystem. Your work here has a significant high workload with $WORK and it affects not only Stampede2 and Frontera system but also many other systems with $WORK mounted.

I persuaded our admins to reactivate your account last time and they have agreed to monitor your jobs one by one to see how you may proceed. We have explicitly told you that let us test and monitor the job one by one and start with the small tests.

I am not sure what I can do at this time...

falkamelung commented 3 years ago

Feb 12

Background: one typical workflow consists of ~200 launcher tasks for steps 1-7 and ~1000 launcher tasks for steps 8-11.

1. Questions to address for uninterrupted runs::

check weather all infiles of steps 8-11 are reading from /tmp (as meant to be implemented)?
how many launcher tasks can we run simultaneously for steps 8-11 (do we need to self-impose job limits as all tasks write to $SCRATCH ?)
same question for steps 1-7 (do we need job limits? some files read from SCRATCH, all steps write to SCRATCH)

2. Actions to evaluate to run without self-imposed job limits:

step 8-11 write exclusively to /tmp: would this allow to lift self-imposed job limits for these steps?
step 1-7 read exclusively from /tmp: this would allow for higher self-imposed job limits for these step, but it requires requires 2-3-times more nodes. Will this be worth it? We may be limited by the limit of 100 jobs on frontera.
step 1-7 write exclusively to /tmp: this would allow to lift all self-imposed jobs limits. The limits will be the maximum of 100 jobs on Frontera and data download limits. The data download limit will go away once the NASA data center supports globus/gridftp.

falkamelung commented 3 years ago

March 30 Hi Falk,

I just ran several short jobs in your launcher job file. Are those jobs are similar?

From what I observed, all IO are on $SCRATCH. There is no IO on $WORK.

1) Your job have intensive IO on $SCRATCH. 11.6K MDS requests in 40 seconds for only one instance. We do need to optimize this part, otherwise your can not run many instances of your jobs. I will provide you an example script in a couple of days.

2) Please remove srun in "srun sed" in your script.

-lei

Hi Falk,

I only ran step 8.

1) module load python_cacher export PYTHON_IO_TargetDir="/scratch/07187/tg864867/codefalk"

2) "cd /tmp" or "cd /dev/shm" before running python scripts.

With these changes, IO requests on $SCRATCH could be down to 20%.

In case python_cacher is not working if you run many instances of python at the same time, you still can copy your python scripts to /tmp.

If you want, you can try using /tmp for output. You can tar outputs and move the tar file to SCRATCH at the end. If you do not have many (e.g., hundreds) output files, it is not necessary to do this. For input files, it may be worth to putting in /tmp if you will use them again and again.

-lei

geodesymiami / rsmas_insar

Stampede/Frontera load issues #437

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Best wishes, Si Liu TACC HPC

export CDTOOL=/scratch/01255/siliu/collect_distribute #Frontera