cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Request to copy stageout files failed (redux) #186

Open rgejman opened 9 years ago

rgejman commented 9 years ago

Referencing issue #173, which I can't seem to re-open. I hope this is not a bad portent.

PBS Job Id: 2589028.mskcc-fe1.local
Job Name:   STDIN
Exec host:  gpu-1-12/7
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-1-12' for job 2589028.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2589028.mskcc-fe1.local.OU to /cbio/cslab/projects/weinhold/hlapred/ucec/STDIN.o2589028, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589028.mskcc-fe1.local.OU': No such file or directory
*** end error output

Unable to copy file /var/spool/torque/spool/2589028.mskcc-fe1.local.ER to /cbio/cslab/projects/weinhold/hlapred/ucec/STDIN.e2589028, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589028.mskcc-fe1.local.ER': No such file or directory
*** end error output
rgejman commented 9 years ago

Another one. Could this be related to #182?

PBS Job Id: 2589518.mskcc-fe1.local
Job Name:   STDIN
Exec host:  gpu-1-12/23
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-1-12' for job 2589518.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2589518.mskcc-fe1.local.OU to /cbio/cslab/projects/weinhold/hlapred/lusc/STDIN.o2589518, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589518.mskcc-fe1.local.OU': No such file or directory
*** end error output

Unable to copy file /var/spool/torque/spool/2589518.mskcc-fe1.local.ER to /cbio/cslab/projects/weinhold/hlapred/lusc/STDIN.e2589518, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589518.mskcc-fe1.local.ER': No such file or directory
*** end error output
tatarsky commented 9 years ago

Not us right now. We are not running migrations.

tatarsky commented 9 years ago

Looking for other causes as I can.

tatarsky commented 9 years ago

gpu-1-12 seems extremely loaded down...trying to determine why. It appears in three of those messages today of a total of five. gpu-1-4 is the other.

rgejman commented 9 years ago

3 more in the past 20 minutes, all on gpu-1-12.

tatarsky commented 9 years ago

You have the highest percentage of these in the logs. Anything interesting about the qsubs I can report to Adaptive? I am looking at the section of the pbs_mom that generates this.

I suspect its migration load related from the relative times. I've backed it off for the morning and will update #182 as the day goes if that changes.

rgejman commented 9 years ago

Not really. The calls are all of the form: echo "script.sh args1 args2 args3 file1 > fileout.txt" | qsub -V -d . -l nodes=1:ppn=1,mem=1gb,pmem=1gb,walltime=12:00:00

tatarsky commented 9 years ago

Draining gpu-1-12 for inspection just in case its node related. I noted the recent ones from it.

But that will be awhile.

ratsch commented 9 years ago

the option "-d ." might be problematic. Try "-d pwd". Does it make a difference?

akahles commented 9 years ago

I guess you meant -dpwd`` but your backticks were replaced by markdown

rgejman commented 9 years ago

I can try this in a future batch of runs. However, I do not think this is the problem for two reasons:

  1. Only a very small percentage of jobs fail with this error (<1%)
  2. The directory is correctly set, as evidenced by the path in the error message.
  3. The error is a file access error when trying to copy "/var/spool/torque/spool/$pid.mskcc-fe1.local.OU" to the current directory
  4. If I interpret @tatarsky correctly, it sounds like some other people are having the same errors. Perhaps they are also using the -d . argument... but probably not.

This argues to me that the problem lies with disk access, not with the the parameters to the job.

tatarsky commented 9 years ago

Actually, if you don't mind in about 4-5 hours there will be a period where the migration is at a step where there should be no disk I/O similar to when we see these. If you could NOT make any changes to your side of the fence it would be helpful.

I believe (but haven't had time to dig heavily into the code surrounding this error) its a form of overload manifestation. But I'm not clear why it would "give up" rather than simply wait. So I will look closer, but am very involved in the migration right now as its a complex step coming up.

rgejman commented 9 years ago

No problem. The jobs are continuously submitted to the server by a master script so as to not overload the scheduler... if that's still even a problem.

tatarsky commented 9 years ago

We can test that another day as well ;)

tatarsky commented 9 years ago

This is just to document that at 5:00PM all migration tasks are idle for a moment. I have some careful steps coming up and am pausing for a moment to review those steps and do my validations of them.

During this time, all I/O is "regular I/O". I expect to return to migration tasks later this evening.

tatarsky commented 9 years ago

Please note I will re-start looking at this shortly now that at least one very large variable is gone. Not today but its back on my run queue.

tatarsky commented 9 years ago

So the primary hit I see on this error now post migration comes from the jobs of @nrosed I believe. It would be interesting to try to compare notes on those submit scripts. File system I/O appears quite light.

tatarsky commented 9 years ago

The incidence of these is being looked at again as I see no compelling reason these are filesystem related. There are periodic bursts however. I am looking at the code closer that surrounds this error message.

tatarsky commented 9 years ago

We are looking at this in light of the recent E_NOMEM token incident. To see if this is related.

akahles commented 9 years ago

Since you are currently looking at this again. I see it still occurring in my logs once in a while:

PBS Job Id: 2910950.mskcc-fe1.local
Job Name:   validate_fastq
Exec host:  gpu-2-6/1
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-2-6' for job 2910950.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2910950.mskcc-fe1.local.OU to /cbio/grlab/projects/ICGC/orig_data/TCGA/validate_fastq/9291fb76-e606-4750-b6da-300108cbacc7/validation.log, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2910950.mskcc-fe1.local.OU': No such file or directory
*** end error output
tatarsky commented 8 years ago

And while I've never figured this out I may attempt to determine if there is a setting I can alter for this in #349

tatarsky commented 8 years ago

So also in the ancient bug bin I've been working on trying to figure this one out.

I've traced a single occurence of this problem on the new Torque system to basically a Torque function called sys_copy and determined what I suspect is some unexpected exit code of the /bin/cp is being returned in some edge case. Yes, its an actual /bin/cp at the end due to the usecp directive we have always had to move spool files from the local node spool to the specified location in the job script.

I've submitted > 100000 jobs to calculate Pi to 10000 places and got one of these failures today.

The code involved appears to have a hardwired 4 attempts. And a rather silly sleep line to wait before trying again.

I should note that my jobs and their output areas are NOT to GPFS homedir locations. But a separate one via NFS that is a convention of the ROCKS environment. So its a good control case.

So I have at least a very specific location of the code I can try with Adaptive again and I believe I can turn up a debug knob for the test new system that would be very impacting on the real one and perhaps get the precise exit status code if I issue another 100000 jobs ;)

Windmills. Tilting. Me.

tatarsky commented 8 years ago

Oh, and in case I lose where I was. The items in question are in the pbs_mom code at:

sys/resmom/requests.c 3088

tatarsky commented 8 years ago

Here BTW is the loop around the attempts to copy:

for (loop = 1;loop < 4;++loop)
    {

(do the cp stuff)

  if ((loop % 2) == 0)
      sleep(loop / 2 * 3 + 1);
    }  /* END for (loop) */

I believe that results in a sleep of 0, 4, 0 and 7 seconds between the attempts. I wish that was explained and perhaps sleep( 10 * loop); is really the cure ;)

I have no idea what you would use the mod to skip a sleep on the odd numbers of the loop.