Closed ericvaandering closed 10 years ago
Comment by belforte on Wed Feb 6 09:22:40 2013
So in the end I played a bit with this myself and researched the little info I found in the web on this topic.
Yes, it works finely. But it creates a sort of zombie ssh, a permanent process on the machine where crab runs which ssh's to _condor@vocms228.
THis is all fine for HC. But I can not put it there for general users, as I do not want to have hundreds of ssh connections always opened on the remote Glidein submission nodes, I need for those sockets to close after a while when not used.
But I think you do not want to keep hacking crab client, so I will look into some (undocumented) option in crab.cfg that enables this persistent socket.
Will not be too soon if we get the latest openssh deployed with the ControlPersist option !
Comment by belforte on Thu Feb 7 11:00:05 2013
added to crab.cfg USER section the option ssh_control_persist behaves like ControlPersist in ssh_config
for HC it can be set to ssh_control_persist=yes
tested and committed, should be OK for HC, more tests needed before releasing to all users /local/reps/CMSSW/COMP/PRODCOMMON/src/python/ProdCommon/BossLite/Scheduler/SchedulerRemoteglidein.py,v <-- SchedulerRemoteglidein.py new revision: 1.23; previous revision: 1.22
/local/reps/CMSSW/COMP/CRAB/python/SchedulerRemoteglidein.py,v <-- SchedulerRemoteglidein.py new revision: 1.15; previous revision: 1.14
/local/reps/CMSSW/COMP/CRAB/python/crab_help.py,v <-- crab_help.py new revision: 1.177; previous revision: 1.176
since changed BossLite also need new PRODCOMMON tag PRODCOMMON_0_12_18_CRAB_53
/local/reps/CMSSW/COMP/CRAB/python/PrepareTarBall.sh,v <-- PrepareTarBall.sh new revision: 1.193; previous revision: 1.192
Closed by belforte on Thu Feb 7 15:13:06 2013
Comment by belforte on Thu Feb 7 15:13:06 2013
releases in CRAB_2_8_5_patch2
Original Savannah ticket 100285 reported by belforte on Tue Feb 5 09:25:26 2013.
HC using ssh to contact submit host needs to be able to do w/o forked sleep processes
here's mail from Ramon:
Hello Stefano,
I will prepare a diff to show the changes more concisely. Basically, our investigation leaded to the following issue: all the crab commands were taking exactly the time of the sleep command crab was issuing over the master ssh control path to keep the connection opened.
I thought that maybe the connection was actually not being multiplexed and the commands were on a queue of 'sleep' - 'command' - 'sleep' - 'command'...
Changing how the ssh master path is launched, with the -N option, allows the daemon to stay opened for some time, but to not execute any command, making the multiplexation possible (or at least, the wait time is 0).
One might have to check whether the ssh connection times out or the ssh is forever opened and if the change makes the proper job. Take into account that while a crab user issues few commands, HammerCloud is monitoring between 450 and 1,000 jobs per test in a given moment.
I will looking at this stuff to confirm if the fix is reliable. At least at this moment the submission rate has improved.
Cheers,
On 30/01/13 19:52, Stefano Belforte wrote: > Ramon, l et's talk about this "off the list" and try to integrate > what you changed in Crab with the main development. > I am a bit puzzled atm since I do not really understand > what you did. > stefano > > On 01/30/2013 07:00 PM, Ramon Medrano Llamas wrote: >> Hello all, >> >> Looks like we have already solved the integration problems we had and HC >> glidein submission is good now. Here is the last test: >> http://hammercloud.cern.ch/hc/app/cms/test/8203/ >> >> To solve this issue, we have modified the CRAB scheduled for remote >> glidein, in particular, the configuration of the daemon to not execute >> any command on the master control path. Also, we have removed the sleep >> 1200, since it is not needed any more. It was, together with the old ssh >> configuration, not allowing connection multiplexation and was causing, >> apart from long timeouts, the deadlock of HC processes. Please find >> attached the new file (I couldn't find any older copy to make a diff, >> apologies). >> >> Now things look very good in my opinion and we'll be ramping up the >> submission tomorrow. >> >> Cheers, >>
Hello Stefano,
I will prepare a diff to show the changes more concisely. Basically, our investigation leaded to the following issue: all the crab commands were taking exactly the time of the sleep command crab was issuing over the master ssh control path to keep the connection opened.
I thought that maybe the connection was actually not being multiplexed and the commands were on a queue of 'sleep' - 'command' - 'sleep' - 'command'...
Changing how the ssh master path is launched, with the -N option, allows the daemon to stay opened for some time, but to not execute any command, making the multiplexation possible (or at least, the wait time is 0).
One might have to check whether the ssh connection times out or the ssh is forever opened and if the change makes the proper job. Take into account that while a crab user issues few commands, HammerCloud is monitoring between 450 and 1,000 jobs per test in a given moment.
I will looking at this stuff to confirm if the fix is reliable. At least at this moment the submission rate has improved.
Cheers,
On 30/01/13 19:52, Stefano Belforte wrote: > Ramon, l et's talk about this "off the list" and try to integrate > what you changed in Crab with the main development. > I am a bit puzzled atm since I do not really understand > what you did. > stefano > > On 01/30/2013 07:00 PM, Ramon Medrano Llamas wrote: >> Hello all, >> >> Looks like we have already solved the integration problems we had and HC >> glidein submission is good now. Here is the last test: >> http://hammercloud.cern.ch/hc/app/cms/test/8203/ >> >> To solve this issue, we have modified the CRAB scheduled for remote >> glidein, in particular, the configuration of the daemon to not execute >> any command on the master control path. Also, we have removed the sleep >> 1200, since it is not needed any more. It was, together with the old ssh >> configuration, not allowing connection multiplexation and was causing, >> apart from long timeouts, the deadlock of HC processes. Please find >> attached the new file (I couldn't find any older copy to make a diff, >> apologies). >> >> Now things look very good in my opinion and we'll be ramping up the >> submission tomorrow. >> >> Cheers, >>