dmwm / CRAB2

CRAB2
2 stars 11 forks source link

make remoteGlidein work w/o sleep for HC #881

Closed ericvaandering closed 10 years ago

ericvaandering commented 10 years ago

Original Savannah ticket 100285 reported by belforte on Tue Feb 5 09:25:26 2013.

HC using ssh to contact submit host needs to be able to do w/o forked sleep processes

here's mail from Ramon:

Hello Stefano,

I will prepare a diff to show the changes more concisely. Basically, our investigation leaded to the following issue: all the crab commands were taking exactly the time of the sleep command crab was issuing over the master ssh control path to keep the connection opened.

I thought that maybe the connection was actually not being multiplexed and the commands were on a queue of 'sleep' - 'command' - 'sleep' - 'command'...

Changing how the ssh master path is launched, with the -N option, allows the daemon to stay opened for some time, but to not execute any command, making the multiplexation possible (or at least, the wait time is 0).

One might have to check whether the ssh connection times out or the ssh is forever opened and if the change makes the proper job. Take into account that while a crab user issues few commands, HammerCloud is monitoring between 450 and 1,000 jobs per test in a given moment.

I will looking at this stuff to confirm if the fix is reliable. At least at this moment the submission rate has improved.

Cheers,

On 30/01/13 19:52, Stefano Belforte wrote: > Ramon, l et's talk about this "off the list" and try to integrate > what you changed in Crab with the main development. > I am a bit puzzled atm since I do not really understand > what you did. > stefano > > On 01/30/2013 07:00 PM, Ramon Medrano Llamas wrote: >> Hello all, >> >> Looks like we have already solved the integration problems we had and HC >> glidein submission is good now. Here is the last test: >> http://hammercloud.cern.ch/hc/app/cms/test/8203/ >> >> To solve this issue, we have modified the CRAB scheduled for remote >> glidein, in particular, the configuration of the daemon to not execute >> any command on the master control path. Also, we have removed the sleep >> 1200, since it is not needed any more. It was, together with the old ssh >> configuration, not allowing connection multiplexation and was causing, >> apart from long timeouts, the deadlock of HC processes. Please find >> attached the new file (I couldn't find any older copy to make a diff, >> apologies). >> >> Now things look very good in my opinion and we'll be ramping up the >> submission tomorrow. >> >> Cheers, >>

Hello Stefano,

I will prepare a diff to show the changes more concisely. Basically, our investigation leaded to the following issue: all the crab commands were taking exactly the time of the sleep command crab was issuing over the master ssh control path to keep the connection opened.

I thought that maybe the connection was actually not being multiplexed and the commands were on a queue of 'sleep' - 'command' - 'sleep' - 'command'...

Changing how the ssh master path is launched, with the -N option, allows the daemon to stay opened for some time, but to not execute any command, making the multiplexation possible (or at least, the wait time is 0).

One might have to check whether the ssh connection times out or the ssh is forever opened and if the change makes the proper job. Take into account that while a crab user issues few commands, HammerCloud is monitoring between 450 and 1,000 jobs per test in a given moment.

I will looking at this stuff to confirm if the fix is reliable. At least at this moment the submission rate has improved.

Cheers,

On 30/01/13 19:52, Stefano Belforte wrote: > Ramon, l et's talk about this "off the list" and try to integrate > what you changed in Crab with the main development. > I am a bit puzzled atm since I do not really understand > what you did. > stefano > > On 01/30/2013 07:00 PM, Ramon Medrano Llamas wrote: >> Hello all, >> >> Looks like we have already solved the integration problems we had and HC >> glidein submission is good now. Here is the last test: >> http://hammercloud.cern.ch/hc/app/cms/test/8203/ >> >> To solve this issue, we have modified the CRAB scheduled for remote >> glidein, in particular, the configuration of the daemon to not execute >> any command on the master control path. Also, we have removed the sleep >> 1200, since it is not needed any more. It was, together with the old ssh >> configuration, not allowing connection multiplexation and was causing, >> apart from long timeouts, the deadlock of HC processes. Please find >> attached the new file (I couldn't find any older copy to make a diff, >> apologies). >> >> Now things look very good in my opinion and we'll be ramping up the >> submission tomorrow. >> >> Cheers, >>

ericvaandering commented 10 years ago

Comment by belforte on Wed Feb 6 09:22:40 2013

So in the end I played a bit with this myself and researched the little info I found in the web on this topic.

Yes, it works finely. But it creates a sort of zombie ssh, a permanent process on the machine where crab runs which ssh's to _condor@vocms228.

THis is all fine for HC. But I can not put it there for general users, as I do not want to have hundreds of ssh connections always opened on the remote Glidein submission nodes, I need for those sockets to close after a while when not used.

But I think you do not want to keep hacking crab client, so I will look into some (undocumented) option in crab.cfg that enables this persistent socket.

Will not be too soon if we get the latest openssh deployed with the ControlPersist option !

ericvaandering commented 10 years ago

Comment by belforte on Thu Feb 7 11:00:05 2013

added to crab.cfg USER section the option ssh_control_persist behaves like ControlPersist in ssh_config

for HC it can be set to ssh_control_persist=yes

tested and committed, should be OK for HC, more tests needed before releasing to all users /local/reps/CMSSW/COMP/PRODCOMMON/src/python/ProdCommon/BossLite/Scheduler/SchedulerRemoteglidein.py,v <-- SchedulerRemoteglidein.py new revision: 1.23; previous revision: 1.22

/local/reps/CMSSW/COMP/CRAB/python/SchedulerRemoteglidein.py,v <-- SchedulerRemoteglidein.py new revision: 1.15; previous revision: 1.14

/local/reps/CMSSW/COMP/CRAB/python/crab_help.py,v <-- crab_help.py new revision: 1.177; previous revision: 1.176

since changed BossLite also need new PRODCOMMON tag PRODCOMMON_0_12_18_CRAB_53

/local/reps/CMSSW/COMP/CRAB/python/PrepareTarBall.sh,v <-- PrepareTarBall.sh new revision: 1.193; previous revision: 1.192

ericvaandering commented 10 years ago

Closed by belforte on Thu Feb 7 15:13:06 2013

ericvaandering commented 10 years ago

Comment by belforte on Thu Feb 7 15:13:06 2013

releases in CRAB_2_8_5_patch2