Open falkamelung opened 3 years ago
January 5
Hi Falk,
1) Yes, you need to change the stripe count before you create/copy the file. That is how the lustre file system works.
2) If the files are small like 0.3GB, but the same files are reused so many times, you'd better just make multiple copies. ("striping" does not help much for small files.)
For the input file(s), you can just make some copies for each run (or maybe one extra copy for each node). One great trick is just making a copy of the input data file into /tmp before you start the heavy computation part (like one copy per node). The files kept in /tmp will not cause extra IO workload and it is also helpful to the performance. Please feel free to let me know if you have any related questions or concerns.
Best wishes, Si
Hi Falk,
Your Stampede2 account was blocked again.
It is very likely that: you are running parallel programs reading the same files at the same time. The fix should be easy. Please "stripe" the data files (mainly input files) using the "lfs setstripe" command as:
That should resolve this problem. Please make the proper changes and I am asking the Stmapede2 admins to reenable your Stampede2 account.
BTW, our Frontera administrators have enabled your Frontera account. Please follow the early rules/restrictions you have used.
Thank you and best wishes, Si Liu
Januar 11 2021
Hi Falk,
We spent some time going through the system log and track what raised the problem. Here is a list of potential jobs that probably caused the heavy IO load on the system. These jobs caused the OSS problems (not MDS this time).
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 6719312 skx-normal run_09_f tg851601 PENDING 0:00 6:09:00 1 (Priority)
6739304 skx-normal run_09_f tg851601 PENDING 0:00 5:09:00 1 (Priority)
6739305 skx-normal run_09_f tg851601 PENDING 0:00 5:09:00 1 (Priority)
6739306 skx-normal run_09_f tg851601 PENDING 0:00 5:09:00 1 (Priority)
6739307 skx-normal run_09_f tg851601 PENDING 0:00 5:09:00 1 (Priority)
6739309 skx-normal run_09_f tg851601 PENDING 0:00 5:09:00 1 (Priority)
6739310 skx-normal run_09_f tg851601 PENDING 0:00 5:09:00 1 (Priority)
6739311 skx-normal run_09_f tg851601 PENDING 0:00 5:09:00 1 (Priority)
6739321 skx-normal run_09_f tg851601 PENDING 0:00 5:09:00 1 (Priority)
6746003 skx-normal run_13_g tg851601 PENDING 0:00 1:02:00 2 (Priority)
6746004 skx-normal run_13_g tg851601 PENDING 0:00 1:02:00 2 (Priority)
6746005 skx-normal run_13_g tg851601 PENDING 0:00 1:02:00 2 (Priority)
6746006 skx-normal run_13_g tg851601 PENDING 0:00 1:02:00 2 (Priority)
6746007 skx-normal run_13_g tg851601 PENDING 0:00 1:02:00 2 (Priority)
6746008 skx-normal run_13_g tg851601 PENDING 0:00 1:02:00 2 (Priority)
6746009 skx-normal run_13_g tg851601 PENDING 0:00 1:02:00 2 (Priority)
6719311 skx-normal run_09_f tg851601 RUNNING 1:51:53 6:09:00 1 c492-114
6719310 skx-normal run_09_f tg851601 RUNNING 1:52:54 6:09:00 1 c492-104
6719309 skx-normal run_09_f tg851601 RUNNING 2:02:34 6:09:00 1 c497-032
6745872 skx-normal run_06_o tg851601 RUNNING 0:00 1:44:00 1 c506-133
6745871 skx-normal run_06_o tg851601 RUNNING 1:53 1:44:00 1 c506-072
6745869 skx-normal run_06_o tg851601 RUNNING 21:24 1:44:00 1 c506-082
6745870 skx-normal run_06_o tg851601 RUNNING 21:24 1:44:00 1 c506-093
6746002 skx-normal run_13_g tg851601 RUNNING 31:33 1:02:00 2 c478-132,c485-01
What we can see is that: Some of your launcher jobs like this one:
/scratch/05861/tg851601/NorthAnatoliaSenAT14/run_files/run_13_generate_burst_igram_0.job
were running 90 parallel file processing jobs and were driving a high load on our OSS servers. My feeling is that these files are using (frequently reading) the same input files. Please take a look at these jobs (run 13) again.
If you have any other questions or concerns, please let me know.
Best wishes, Si Liu
January 20 2021
Hi Falk,
There are two typical IO workload issues.
1) Heavy MDS workload. It is normally caused by the high-frequency IO requests (like thousands of file open/close/stat in a short period of time). You should use python_cacher and OOOPS to help.
2) Heavey OSS workload That is normally caused by the same file loaded by so many processes at the same file (like an input file loaded by binaries running with each task). Make a copy of the file under /tmp on each compute node (Therefore you have one local copy for shared by 48 (or fewer) tasks, instead of one global file shared by all tasks).
I am writing a program for users to distribute/collect files /tmp. Let me send it to you after lunch.
Best wishes, Si Liu
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Hi Falk,
Here is the program to collect/distribute files from/to /tmp. Hopefully, this will help.
A) To set it up:
Please load the latest intel/19 compilers for best performance.
export CDTOOL=/scratch1/01255/siliu/collect_distribute #Frontera
export PATH=${PATH}:${CDTOOL}/bin
B) Distribute your files to /tmp by:
distribute.bash ${CDTOOL}/datafiles/inputfile distribute.bash ${CDTOOL}/datafiles/inputdir distribute.bash ${CDTOOL}/datafiles/outputfile distribute.bash ${CDTOOL}/datafiles/outputdir
C) Collect your file from /tmp by:
collect.bash /tmp/outputdir ${CDTOOL}/datafiles/new_output_collected collect.bash /tmp/outputfile ${CDTOOL}/datafiles/new_output_collected
This version should work for either file or directory.
1) If the number of files is not large, you can distribute the whole directory directly. 2) We are thinking of the distribute+untar and let me try to implement it this week.
This is a new program, please use it carefully (make sure you have the files under /tmp after the distribution work). If you notice any further issues, please feel free to let me know.
You may have some issues with "srun tar -xvf /tmp/test.tar" if you have multiple tasks per node. It may run more than once per node.
Best wishes, Si Liu TACC HPC
Jan 21
--- But please make sure that you have the setup we discussed before (ooops, python_cacher, striped large files, etc.). --- Meanwhile, please copy/distribute the required files to /tmp on each node in advance. This should help you avoid the unnecessary OSS load. You can do it either in your "sbcast" way or use the tool I sent you earlier.
I will let you know when our system administrator has an update.
BTW, please try to reply to ticket 65018 for Stampede2 issues and ticket 65829 for Frontera issues in the future (This ticket is the Stampede2 one). It will be easier to track and follow.
My collect-distribute program is built with intel19 for the best practice and performance. If you need to run it now, please use intel19 before running the "distribute" program. I may hard-code the intel19 and module usage “by force" (maybe next week).
You should always have module load ooops and python cacher in your SLURM job script.
what is the current limitation you used for your Stampede2 work. I remember 5 node and 10 nodes. What do you plan for Frontera?
Please not: your IO work has caused file system issues multiple times for various reasons on both Stmapede2 and Frontera. Our system administrators are really worried about the IO work here. Please use the tools we provide and be very careful about the IO work. If you are not sure whether some work is going to be fine, please feel free to work with us before you really run it in a productive way.
Initial queue access to Stampede2 and Frontera was blocked because of too many simultaneous wget
processes and because we did not use python_cacher. More recent blocks were because of too heavy MDS and OSS workloads of some of the computing steps (we have a total of 11 different run steps).
The following has been implemented:
wget
process has been limited to 5 simultaneous downloadspython_cacher
and ooops
./scratch
. To avoid this this file is now copied into local /tmp
for accessing (we load at load as a function of launcher tasks and not nodes because the number of simultaneously running tasks varies because of varying memory requiurements)/scratch
, we have created a job submission script that limits the number of simultaneously running launcher tasks. It submits a new job only if the number of tasks will remain below a pre-set limit. If there are too many active tasks it will wait and try to submit a few minutes later again. We currently have limits of 3000 total tasks (any step) and 1500 tasks for each step (we run several workflows simultaneously). If we can find out which run steps are creating the heaviest loads I could put stricter limits on those steps ( e.g. 500 tasks or 10 nodes) and relax the limits for steps that are of no concern (e.g. 5000 tasks).Striping. We have not implemented as our files don't go over 0.3 GB.
Example job file:
#! /bin/bash
#SBATCH -J run_04_fullBurst_geo2rdr_0
#SBATCH -A TG-EAR200012
#SBATCH --mail-user=famelung@rsmas.miami.edu
#SBATCH --mail-type=fail
#SBATCH -N 1
#SBATCH -n 48
#SBATCH -o /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0_%J.o
#SBATCH -e /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0_%J.e
#SBATCH -p skx-dev
#SBATCH -t 0:06:00
module load launcher
export OMP_NUM_THREADS=4
export PATH=/work/05861/tg851601/stampede2/test/code/rsmas_insar/sources/isce2/contrib/stack/topsStack:$PATH
export LAUNCHER_WORKDIR=/scratch/05861/tg851601/unittestGalapagosSenDT128
export LAUNCHER_PPN=6
export LAUNCHER_JOB_FILE=/scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0
module load python_cacher
export PYTHON_IO_CACHE_CWD=0
module load ooops
### copy infiles to local /tmp and adjust xml files ###
export CDTOOL=/scratch/01255/siliu/collect_distribute
module load intel/19.1.1
export PATH=${PATH}:${CDTOOL}/bin
distribute.bash /scratch/05861/tg851601/unittestGalapagosSenDT128/reference
distribute.bash /scratch/05861/tg851601/unittestGalapagosSenDT128/geom_reference
files="/tmp/*reference/*.xml /tmp/*reference/*/*.xml"
old=/scratch/05861/tg851601/unittestGalapagosSenDT128
srun sed -i "s|$old|/tmp|g" $files
$LAUNCHER_DIR/paramrun
Feb 1 Hi Falk,
1) For your $WORK usage: The problem is that some of your jobs were causing problems with too many requests to /work with python. Our system admins have not reported the improper usage of login nodes. But let us keep such jobs running on compute nodes any way.
Please create a job script for this work with ooops and python_cacher. Point me to the job script and maybe give me a description of what this job will do there.
I am not sure how hard to run this program under our account (instead of yours). If it is not too hard, we'd like to run it on our side first.
2) As for your job script "which runs every 30 seconds an sacct
process and every 5 minutes ~20 scontrol
processes.":
This workload should be OK, but if you run such slurm commands every second (or you run it with too many tasks), there could be another problem.
Best wishes, Si
Jan 22: Hi Falk,
FYI, our Frontera/Stampede2 administrators told me that your account is active on both systems at this time.
Please follow the early rules/restrictions you have used.
For the OOOPS issue:
Best wishes, Si Liu TACC HPC
Feb 4
Hi Falk,
1) I had a conversation with our admins team. Your IO job is a big concern to them for a while. They have spent a lot of time and effort on fixing the problems caused by your early jobs.
Let us try to run your jobs only one at a time to make sure everything is fine.
2) The latest problem is about the high workload on $WORK from Python. I think you should have loaded the python_cacher module.
Please put OOOPS and python_cacher module in your job script and run your jobs ONLY on compute node. We will monitor these jobs closely for you and see why it still raised high workload with $WORK. We can start with the one you ran on login nodes earlier.
3) I also checked some running report of the jobs you mentioned earlier (those 3-node launcher jobs).
For these jobs, we can find the system counts here: https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstats.stampede2.tacc.utexas.edu%2Fmachine%2Fjob%2F7184630%2Fmdc%2F&data=04%7C01%7Cfamelung%40rsmas.miami.edu%7C636ed2cef20c47290eb208d8c923da50%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637480503113277869%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XL49okvDT7l8NbuRVgwD0pXO4zWHChussnX%2BRweELxQ%3D&reserved=0 https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstats.stampede2.tacc.utexas.edu%2Fmachine%2Fjob%2F7186458%2Fmdc%2F&data=04%7C01%7Cfamelung%40rsmas.miami.edu%7C636ed2cef20c47290eb208d8c923da50%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637480503113287864%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=T%2FWr7tPt3RwQs5nzkyXP9YUi24tTwTWN6hL6aSzXy0E%3D&reserved=0 https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fstats.stampede2.tacc.utexas.edu%2Fmachine%2Fjob%2F7186459%2Fmdc%2F&data=04%7C01%7Cfamelung%40rsmas.miami.edu%7C636ed2cef20c47290eb208d8c923da50%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637480503113287864%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=v1QZPTcwD4i7BKzTelRLflHVOUa4oMbAZbRwXaVvIDo%3D&reserved=0
There is a significant high MDC wait (from the bottom pictures of the report) within the first few minutes of each job: about 50000+ wait * 3 node for each job.
I guess multiple runs in your 3-node launcher jobs are competing for the same input files on $SCRATCH. You do need to distribute the required input files at the beginning for this kind of IO pattern.
4) There are too many tickets for your IO issues. Please just reply to 65018 or 65829 and I am going to close others.
67762 #65018 Stampede2 queue access still suspended internal_wait 67529 Fwd: TACC Consulting #65829 frontera IO work new 67514 TACC Consulting #65829 frontera IO work open 65829 Frontera IO work (tg851601) open 65018 Stampede2 IO work (tg851601)) open
Best wishes, Si Liu
KokoxiliBigChunk36SenAT41/run_files/run_08_generate_burst_igram_6_7184630.e
KokoxiliBigChunk39SenAT41/run_files/run_08_generate_burst_igram_7_7186458.e
KokoxiliBigChunk39SenAT41/run_files/run_08_generate_burst_igram_8_7186459.e
Feb 5 Hi Falk,
From my record, the last time our admins blocked you on Stampede2 because some of your jobs raised high IO requests with Python modules.
1) If you have OOOPS and Python_cacher loaded, it should not happen (theoretically). Maybe you have some cases running without Python_Cacher. So let us retry it to confirm.
2) If you have distributed input files to /tmp, you should be fine. I do not think our admins blocked your launcher jobs for the number of tasks at this time. But we will run and monitor those lanucher jobs with you too.
3) If possible, you should consider using /tmp as much as possible. Though OOOPS can protect the system, it may make the program run much longer as you already seen. Using /tmp more could help you improve the performance.
Our system admins worries more about the stability of the filesystem. Using OOOPS and Python_cacher on your job is good enough to them. But for you, the overall cost and performance should be a concern.
This week, our system administrators are busy with the Texascale runs and the coming maintenance. Please prepare the job script and we will run it with you together next week.
Best wishes, Si Liu
Feb 10:
Hi Falk,
It is not about the test programs.
It looks like you submitted so many runs again. Here is what I can see now.
7265480.bat+ batch tg-ear200+ 96 COMPLETED 0:0 7265480.0 sed tg-ear200+ 96 FAILED 2:0 7265481 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265481.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265481.0 sed tg-ear200+ 96 FAILED 2:0 7265482 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265482.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265482.0 sed tg-ear200+ 96 FAILED 2:0 7265483 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265483.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265483.0 sed tg-ear200+ 96 FAILED 2:0 7265484 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265484.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265484.0 sed tg-ear200+ 96 FAILED 2:0 7265485 run_11_un+ skx-normal tg-ear200+ 192 CANCELLED+ 0:0 7265485.bat+ batch tg-ear200+ 96 CANCELLED 0:15 7265485.0 sed tg-ear200+ 96 FAILED 2:0 7265532 run_11_un+ skx-normal tg-ear200+ 96 CANCELLED+ 0:0
You submitted so many of them and we do not get any notification about them at all. Our system admins had to kill all of the jobs and worked on the filesystem. Your work here has a significant high workload with $WORK and it affects not only Stampede2 and Frontera system but also many other systems with $WORK mounted.
I persuaded our admins to reactivate your account last time and they have agreed to monitor your jobs one by one to see how you may proceed. We have explicitly told you that let us test and monitor the job one by one and start with the small tests.
I am not sure what I can do at this time...
Feb 12
Background: one typical workflow consists of ~200 launcher tasks for steps 1-7 and ~1000 launcher tasks for steps 8-11.
1. Questions to address for uninterrupted runs::
2. Actions to evaluate to run without self-imposed job limits:
March 30 Hi Falk,
I just ran several short jobs in your launcher job file. Are those jobs are similar?
From what I observed, all IO are on $SCRATCH. There is no IO on $WORK.
1) Your job have intensive IO on $SCRATCH. 11.6K MDS requests in 40 seconds for only one instance. We do need to optimize this part, otherwise your can not run many instances of your jobs. I will provide you an example script in a couple of days.
2) Please remove srun in "srun sed" in your script.
-lei
Hi Falk,
I only ran step 8.
1) module load python_cacher export PYTHON_IO_TargetDir="/scratch/07187/tg864867/codefalk"
2) "cd /tmp" or "cd /dev/shm" before running python scripts.
With these changes, IO requests on $SCRATCH could be down to 20%.
In case python_cacher is not working if you run many instances of python at the same time, you still can copy your python scripts to /tmp.
If you want, you can try using /tmp for output. You can tar outputs and move the tar file to SCRATCH at the end. If you do not have many (e.g., hundreds) output files, it is not necessary to do this. For input files, it may be worth to putting in /tmp if you will use them again and again.
-lei
Date: December 4, 2020 at 6:49:54 PM EST Reply-To: help@consult.tacc.utexas.edu
Hi Falk,
That job (6927882) only lasted for 37 seconds. There is not a lot of information we got from that job.
What is more important, we spent some time measure the filesystem load. Based on all the following runs here, we would suggest you restrict the number and size of the simultaneous of run_07 a little bit.
6918285@s2 s2-scratch 85.196 183.509 10385.717 tg851601 run_07_pai 2 6918278@s2 s2-scratch 81.413 172.133 11778.453 tg851601 run_07_pai 2 6918283@s2 s2-scratch 78.643 160.213 11472.952 tg851601 run_07_pai 2 6918284@s2 s2-scratch 68.734 141.068 9988.845 tg851601 run_07_pai 2 6918282@s2 s2-scratch 59.377 138.107 7155.432 tg851601 run_07_pai 2 6918280@s2 s2-scratch 57.908 113.568 8638.874 tg851601 run_07_pai 2 6918286@s2 s2-scratch 50.044 95.118 8542.089 tg851601 run_07_pai 2 6925936@s2 s2-scratch 3.466 20.599 14.457 tg851601 run_06_ove 1 6925937@s2 s2-scratch 3.121 27.089 14.532 tg851601 run_06_ove 1 6925935@s2 s2-scratch 3.071 24.544 13.811 tg851601 run_06_ove 1 6925934@s2 s2-scratch 3.029 24.564 13.254 tg851601 run_06_ove 1 6925893@s2 s2-scratch 0.040 0.007 0.129 tg851601 run_05_ove 1 6925891@s2 s2-scratch 0.038 0.007 0.079 tg851601 run_05_ove 1 6918074@s2 s2-scratch 0.029 22.768 78.487 tg851601 run_07_pai 2
Assume you are using all cores on each Stampede2 nodes here, please restrict the number of nodes working with jobs like run_07 to 10 nodes in all (like 5 2-node runs at any time). That should keep the filesystem in a good/stable status.
If you have any other jobs (like run07) with similar work IO load. Please keep them within the 10 node limits (10 nodes for all of your IO-intensive work). Then your workflow should be fine.
Please run under this workload level on Stampede2 with other settings we talked about before (the python_cache is still necessary; using stripe for large files). We will keep monitoring your Stampede2 runs in the following weeks. If you notice anything or have any potentially dangerous job, please feel free to contact us.
Frontera will be under maintenance and Texasscale runs in the following week. I will ask our system administrators to reactivate your account after the Texasscale week.
Best wishes, Si Liu TACC HPC
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The problem on Stampede/Frontera are "excessive OSS load" and/or "overload of the MDS from the launcher jobs". This means too many IO requests in one or more processing steps. Currently it is unclear whether the problem is caused by read or write requests and in which step the problem occurs. Once identified it will be easy to modify the workflow so that only a limited number of jobs of the offending processing step is run simultaneously.
A potential problem is that in several processing steps the same file is accessed by each launcher job. If 10 run_07 jobs are running simultaneously 480 jobs access the same file (the reference image).
On Frontera the workflow was killed during (likely) run_07 and run_13 making these the likely culprits.
I suspect the workflows were killed on Frontera because more jobs were running simultaneously compared to Stampede.
With TACC's help we need to monitor jobs from all processing steps to identify which step causes the problem.
The following steps have been cleared: run_05 run_06
Cleared? (Si needs to confirm) run_13 run_14 run_15
run_09 (partially cleared - there was no IO. I am not sure whether Si monitored the entire job as they take long)
Problems: run_07
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Nov 30 Hi Si plus team, Following my yesterdays email, could you please kindly monitor the jobs below? Some of them have not started yet. The first one is which I suspect is the offending job. Please let me know once you are done. After that I will send you the remaining jobs for monitoroing. In total there are 16 different run steps. With these once we will have done 9 of them. Thank you! Falk
I am looking at the job 6918066 right now. Roughly, this job is having <1k WR_MB and <2k Read_MB per second, which are OK.
The number of MDS IO requests has a peak number over 50k occasionally (That could be a bad thing if you run multiple similar jobs there.)
FILESYSTEM MDS/T LOAD1 LOAD5 LOAD15 TASKS OSS/T LOAD1 LOAD5 LOAD15 TASKS NIDS s2-home 2/2 0.21 0.26 0.29 1281 4/4 0.34 0.56 0.53 2167 5983 s2-scratch 4/4 13.69 10.50 8.90 1436 65/66 45.07 57.85 61.05 2258 6046
JOB FS WR_MB/S RD_MB/S REQS/S OWNER NAME HOSTS 6918066@s2 s2-scratch 629.657 1137.472 52181.687 tg851601 run_07_pai 2 6918066@s2 s2-home 0.000 0.000 0.258 tg851601 run_07_pai 2
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi Falk,
Our system administrator told me that they had to block your Frontera access again last night (due to excessive OSS load with MANY jobs running at one time), while we are still working on your Stampede2 workflow...
Could you please tell us what you are working with on Frontera? Is this a different project? If so, please let us know more details about your work.
Could you please also point us to your job script and input/output files?
Best wishes, Si Liu TACC HPC
Hi Falk,
I checked the notes and logs from our system administrators and consultants and here is a quick summary.
1) The issue about the login node download (so many wget) We have agreed that you can limit the number of simultaneous wget below 5. This should not be a concern anymore.
2) Your programs/scripts are working with many python libraries there. Please make sure you have "ml python_cacher" in your job script. That will help you release the pressure of repeatedly loading those python libs. This should not be a concern as long as you have the python_cacher module loaded.
3) As for the excessive OSS load: My first guess is that you are working with large files without striping.
Once you run a lot of programs simultaneously with the same file, or once your filesize is huge. The same OSS has to handle all the requests (without striping). It could trouble the filesystem.
Some of our early notes indicate that you have been striping your large files (in mid July). Did you forget to do it on the Frontera?
4) Another thing we need to worry about is the overload of the MDS from the launcher jobs (Once you have so many jobs raising IO requests in a short period of time, it may be a problem.) That is why I also suggested limiting the number of jobs until we believe they are safe
Right now, could you take a look at the jobs that were canceled last time and see what they are exactly and what they were doing there? If you believe it is run_07. I can take a deep look and may try it.
Let us try it gently on Stampede2 now. I do not think our Frontera admins can reenable your account before we make proper changes there.
Best wishes, Si Liu