Open rgejman opened 9 years ago
Another one. Could this be related to #182?
PBS Job Id: 2589518.mskcc-fe1.local
Job Name: STDIN
Exec host: gpu-1-12/23
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-1-12' for job 2589518.mskcc-fe1.local
Unable to copy file /var/spool/torque/spool/2589518.mskcc-fe1.local.OU to /cbio/cslab/projects/weinhold/hlapred/lusc/STDIN.o2589518, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589518.mskcc-fe1.local.OU': No such file or directory
*** end error output
Unable to copy file /var/spool/torque/spool/2589518.mskcc-fe1.local.ER to /cbio/cslab/projects/weinhold/hlapred/lusc/STDIN.e2589518, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589518.mskcc-fe1.local.ER': No such file or directory
*** end error output
Not us right now. We are not running migrations.
Looking for other causes as I can.
gpu-1-12 seems extremely loaded down...trying to determine why. It appears in three of those messages today of a total of five. gpu-1-4 is the other.
3 more in the past 20 minutes, all on gpu-1-12.
You have the highest percentage of these in the logs. Anything interesting about the qsubs I can report to Adaptive? I am looking at the section of the pbs_mom that generates this.
I suspect its migration load related from the relative times. I've backed it off for the morning and will update #182 as the day goes if that changes.
Not really. The calls are all of the form:
echo "script.sh args1 args2 args3 file1 > fileout.txt" | qsub -V -d . -l nodes=1:ppn=1,mem=1gb,pmem=1gb,walltime=12:00:00
Draining gpu-1-12 for inspection just in case its node related. I noted the recent ones from it.
But that will be awhile.
the option "-d ." might be problematic. Try "-d pwd
". Does it make a difference?
I guess you meant -d
pwd`` but your backticks were replaced by markdown
I can try this in a future batch of runs. However, I do not think this is the problem for two reasons:
-d .
argument... but probably not.This argues to me that the problem lies with disk access, not with the the parameters to the job.
Actually, if you don't mind in about 4-5 hours there will be a period where the migration is at a step where there should be no disk I/O similar to when we see these. If you could NOT make any changes to your side of the fence it would be helpful.
I believe (but haven't had time to dig heavily into the code surrounding this error) its a form of overload manifestation. But I'm not clear why it would "give up" rather than simply wait. So I will look closer, but am very involved in the migration right now as its a complex step coming up.
No problem. The jobs are continuously submitted to the server by a master script so as to not overload the scheduler... if that's still even a problem.
We can test that another day as well ;)
This is just to document that at 5:00PM all migration tasks are idle for a moment. I have some careful steps coming up and am pausing for a moment to review those steps and do my validations of them.
During this time, all I/O is "regular I/O". I expect to return to migration tasks later this evening.
Please note I will re-start looking at this shortly now that at least one very large variable is gone. Not today but its back on my run queue.
So the primary hit I see on this error now post migration comes from the jobs of @nrosed I believe. It would be interesting to try to compare notes on those submit scripts. File system I/O appears quite light.
The incidence of these is being looked at again as I see no compelling reason these are filesystem related. There are periodic bursts however. I am looking at the code closer that surrounds this error message.
We are looking at this in light of the recent E_NOMEM token incident. To see if this is related.
Since you are currently looking at this again. I see it still occurring in my logs once in a while:
PBS Job Id: 2910950.mskcc-fe1.local
Job Name: validate_fastq
Exec host: gpu-2-6/1
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-2-6' for job 2910950.mskcc-fe1.local
Unable to copy file /var/spool/torque/spool/2910950.mskcc-fe1.local.OU to /cbio/grlab/projects/ICGC/orig_data/TCGA/validate_fastq/9291fb76-e606-4750-b6da-300108cbacc7/validation.log, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2910950.mskcc-fe1.local.OU': No such file or directory
*** end error output
And while I've never figured this out I may attempt to determine if there is a setting I can alter for this in #349
So also in the ancient bug bin I've been working on trying to figure this one out.
I've traced a single occurence of this problem on the new Torque system to basically a Torque function called sys_copy
and determined what I suspect is some unexpected exit code of the /bin/cp
is being returned in some edge case. Yes, its an actual /bin/cp
at the end due to the usecp
directive we have always had to move spool files from the local node spool to the specified location in the job script.
I've submitted > 100000 jobs to calculate Pi to 10000 places and got one of these failures today.
The code involved appears to have a hardwired 4 attempts. And a rather silly sleep line to wait before trying again.
I should note that my jobs and their output areas are NOT to GPFS homedir locations. But a separate one via NFS that is a convention of the ROCKS environment. So its a good control case.
So I have at least a very specific location of the code I can try with Adaptive again and I believe I can turn up a debug knob for the test new system that would be very impacting on the real one and perhaps get the precise exit status code if I issue another 100000 jobs ;)
Windmills. Tilting. Me.
Oh, and in case I lose where I was. The items in question are in the pbs_mom code at:
sys/resmom/requests.c 3088
Here BTW is the loop around the attempts to copy:
for (loop = 1;loop < 4;++loop)
{
(do the cp stuff)
if ((loop % 2) == 0)
sleep(loop / 2 * 3 + 1);
} /* END for (loop) */
I believe that results in a sleep of 0, 4, 0 and 7 seconds between the attempts. I wish that was explained and perhaps sleep( 10 * loop);
is really the cure ;)
I have no idea what you would use the mod to skip a sleep on the odd numbers of the loop.
Referencing issue #173, which I can't seem to re-open. I hope this is not a bad portent.