dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

JobSubmitter silently dying #8190

Open scarletnorberg opened 7 years ago

scarletnorberg commented 7 years ago

I think it is submitting to the wrong sites or something here is the error:

2017-09-24 12:17:44,745:139752376227584:INFO:JobSubmitterPoller:Have 27 packages to submit. 2017-09-24 12:17:44,745:139752376227584:INFO:JobSubmitterPoller:Have 1000 jobs to submit. 2017-09-24 12:17:44,745:139752376227584:INFO:JobSubmitterPoller:Done assigning site locations. 2017-09-24 12:18:57,307:139752376227584:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin 2017-09-24 12:18:58,272:139752376227584:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 1000/0. 2017-09-24 12:19:01,983:139752376227584:INFO:DashboardReporter:Handling 1000 jobs 2017-09-24 12:41:32,861:139752376227584:INFO:JobSubmitterPoller:Transaction cycle successfully completed. 2017-09-24 12:44:10,772:139752376227584:INFO:JobSubmitterPoller:Refreshing priority cache with currently 189379 jobs 2017-09-24 12:44:10,858:139752376227584:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (189379 job in cache) 2017-09-24 12:44:10,858:139752376227584:INFO:JobSubmitterPoller:Determining possible sites for new jobs... 2017-09-24 12:44:22,684:139752376227584:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit. 2017-09-24 12:55:29,685:139752376227584:INFO:JobSubmitterPoller:Site submission report: {'T1_IT_CNAF': Counter({'submitted': 2}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 101}), 'T2_IT_Rome': Counter({'NoPendingSlot': 1}), 'T1_ES_PIC': Counter({'NoPendingSlot': 165}), 'T1_FR_CCIN2P3': Counter({'submitted': 1}), 'T2_US_Florida': Counter({'NoPendingSlot': 93}), 'T2_FR_GRIF_IRFU': Counter({'NoPendingSlot': 96}), 'T2_US_UCSD': Counter({'submitted': 11}), 'T2_US_Purdue': Counter({'NoPendingSlot': 87}), 'T2_US_MIT': Counter({'submitted': 663}), 'T2_UK_SGrid_RALPP': Counter({'submitted': 15}), 'T2_FR_GRIF_LLR': Counter({'submitted': 67, 'NoPendingSlot': 29}), 'T2_UK_London_IC': Counter({'NoPendingSlot': 12}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 19}), 'T2_IT_Bari': Counter({'NoPendingSlot': 53}), 'T2_CH_CERN_HLT': Counter({'NoPendingSlot': 74}), 'T2_ES_CIEMAT': Counter({'submitted': 38}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 90}), 'T1_RU_JINR': Counter({'submitted': 99}), 'T2_US_Wisconsin': Counter({'submitted': 16}), 'T2_DE_DESY': Counter({'submitted': 8}), 'T1_UK_RAL': Counter({'NoPendingSlot': 150}), 'T1_US_FNAL': Counter({'NoPendingSlot': 985}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 74}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 13, 'submitted': 2}), 'T2_US_Caltech': Counter({'submitted': 5}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 37}), 'T2_BE_IIHE': Counter({'NoTaskSlot': 110, 'submitted': 3}), 'T1_DE_KIT': Counter({'NoPendingSlot': 44}), 'T2_IT_Pisa': Counter({'NoPendingSlot': 45}), 'T2_CH_CERN': Counter({'submitted': 70, 'NoPendingSlot': 29})} 2017-09-24 12:55:29,686:139752376227584:INFO:JobSubmitterPoller:Priority submission report: {50110000.0: Counter({'Total': 403}), 30063000.0: Counter({'Total': 13}), 50085000.0: Counter({'Total': 1002}), 50090000.0: Counter({'Total': 2}), 50063000.0: Counter({'Total': 121}), 30085000.0: Counter({'Total': 42}), 110000.0: Counter({'Total': 2}), 85000.0: Counter({'Total': 1200, 'submitted': 895}), 63000.0: Counter({'Total': 340, 'submitted': 105})} 2017-09-24 12:55:29,687:139752376227584:INFO:JobSubmitterPoller:Have 20 packages to submit. 2017-09-24 12:55:29,927:139752376227584:INFO:JobSubmitterPoller:Have 1000 jobs to submit. 2017-09-24 12:55:29,927:139752376227584:INFO:JobSubmitterPoller:Done assigning site locations. 2017-09-24 12:56:31,284:139752376227584:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin 2017-09-24 13:33:40,332:139752376227584:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 1000/0. 2017-09-24 13:33:44,448:139752376227584:INFO:DashboardReporter:Handling 1000 jobs 2017-09-24 13:58:54,232:139752376227584:INFO:JobSubmitterPoller:Transaction cycle successfully completed. 2017-09-24 14:01:05,899:139752376227584:INFO:JobSubmitterPoller:Refreshing priority cache with currently 188379 jobs 2017-09-24 14:01:05,945:139752376227584:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (188379 job in cache) 2017-09-24 14:01:05,945:139752376227584:INFO:JobSubmitterPoller:Determining possible sites for new jobs... 2017-09-24 14:01:05,950:139752376227584:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit. 2017-09-24 14:03:09,202:139752376227584:INFO:JobSubmitterPoller:Site submission report: {u'T1_IT_CNAF': Counter({'submitted': 12}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 61, 'submitted': 42}), 'T2_CH_CERN_HLT': Counter({'submitted': 60}), 'T1_ES_PIC': Counter({'submitted': 167}), 'T1_FR_CCIN2P3': Counter({'submitted': 20}), 'T2_DE_RWTH': Counter({'submitted': 10, 'NoPendingSlot': 8}), 'T2_US_Florida': Counter({'submitted': 115}), 'T2_FR_GRIF_IRFU': Counter({'submitted': 28, 'NoPendingSlot': 4}), 'T2_US_UCSD': Counter({'submitted': 2}), 'T2_US_Purdue': Counter({'NoPendingSlot': 86, 'submitted': 32}), 'T0_CH_CERN': Counter({'submitted': 7}), 'T2_UK_London_IC': Counter({'submitted': 4}), 'T2_FR_GRIF_LLR': Counter({'submitted': 4}), 'T2_US_Nebraska': Counter({'submitted': 58, 'NoPendingSlot': 20}), 'T2_BE_UCL': Counter({'submitted': 5}), 'T2_IT_Legnaro': Counter({'submitted': 25, 'NoPendingSlot': 4}), 'T2_IT_Bari': Counter({'NoPendingSlot': 54, 'submitted': 1}), 'T2_ES_CIEMAT': Counter({'submitted': 3}), 'T2_FR_IPHC': Counter({'submitted': 59, 'NoPendingSlot': 34}), 'T1_RU_JINR': Counter({'submitted': 9}), 'T2_US_Wisconsin': Counter({'submitted': 6}), 'T2_DE_DESY': Counter({'submitted': 5}), 'T1_UK_RAL': Counter({'submitted': 88, 'NoPendingSlot': 62}), 'T1_US_FNAL': Counter({'NoPendingSlot': 869, 'submitted': 122}), 'T2_US_MIT': Counter({'submitted': 4}), 'T2_BE_IIHE': Counter({'NoTaskSlot': 110}), 'T2_IT_Rome': Counter({'NoPendingSlot': 1}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 31, 'submitted': 6}), 'T2_CH_CERN': Counter({'submitted': 25}), 'T1_DE_KIT': Counter({'submitted': 29, 'NoPendingSlot': 17}), 'T2_IT_Pisa': Counter({'submitted': 52})} 2017-09-24 14:03:09,204:139752376227584:INFO:JobSubmitterPoller:Priority submission report: {50110000.0: Counter({'Total': 403, 'submitted': 224}), 30063000.0: Counter({'Total': 13, 'submitted': 10}), 50085000.0: Counter({'Total': 1002, 'submitted': 133}), 50090000.0: Counter({'Total': 2}), 50063000.0: Counter({'Total': 121, 'submitted': 66}), 30085000.0: Counter({'Total': 42, 'submitted': 36}), 110000.0: Counter({'Total': 2, 'submitted': 2}), 85000.0: Counter({'Total': 305, 'submitted': 252}), 63000.0: Counter({'Total': 467, 'submitted': 277})} 2017-09-24 14:03:09,220:139752376227584:INFO:JobSubmitterPoller:Have 62 packages to submit. 2017-09-24 14:03:09,220:139752376227584:INFO:JobSubmitterPoller:Have 1000 jobs to submit. 2017-09-24 14:03:09,221:139752376227584:INFO:JobSubmitterPoller:Done assigning site locations. 2017-09-24 14:04:37,596:139752376227584:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin 2017-09-24 14:04:38,909:139752376227584:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 1000/0. 2017-09-24 14:22:32,267:139752376227584:INFO:DashboardReporter:Handling 1000 jobs 2017-09-24 14:24:44,116:139752376227584:INFO:JobSubmitterPoller:Transaction cycle successfully completed. 2017-09-24 14:26:48,994:139752376227584:INFO:JobSubmitterPoller:Refreshing priority cache with currently 187379 jobs 2017-09-24 14:26:49,103:139752376227584:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (187379 job in cache) 2017-09-24 14:26:49,104:139752376227584:INFO:JobSubmitterPoller:Determining possible sites for new jobs... 2017-09-24 14:26:49,107:139752376227584:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit. 2017-09-24 14:26:54,331:139752376227584:INFO:JobSubmitterPoller:Site submission report: {u'T1_IT_CNAF': Counter({'submitted': 883, 'NoPendingSlot': 107}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 61}), 'T2_IT_Rome': Counter({'NoPendingSlot': 1}), 'T1_FR_CCIN2P3': Counter({'submitted': 1}), 'T2_DE_RWTH': Counter({'NoPendingSlot': 44}), 'T2_US_Purdue': Counter({'NoPendingSlot': 122}), u'T2_FR_GRIF_LLR': Counter({'NoPendingSlot': 36}), 'T2_US_Nebraska': Counter({'NoPendingSlot': 20}), 'T1_DE_KIT': Counter({'NoPendingSlot': 17}), 'T2_IT_Bari': Counter({'NoPendingSlot': 54}), 'T2_FR_IPHC': Counter({'submitted': 36, 'NoPendingSlot': 34}), u'T1_ES_PIC': Counter({'submitted': 71, 'NoPendingSlot': 36}), 'T1_UK_RAL': Counter({'NoPendingSlot': 62}), 'T1_US_FNAL': Counter({'NoPendingSlot': 869}), 'T2_IT_Legnaro': Counter({'NoPendingSlot': 4}), 'T2_CH_CERN_HLT': Counter({'submitted': 9}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 31}), 'T2_BE_IIHE': Counter({'NoTaskSlot': 110})} 2017-09-24 14:26:54,332:139752376227584:INFO:JobSubmitterPoller:Priority submission report: {50110000.0: Counter({'Total': 179}), 30063000.0: Counter({'Total': 3}), 50085000.0: Counter({'Total': 869}), 50090000.0: Counter({'Total': 2}), 50063000.0: Counter({'Total': 55}), 30085000.0: Counter({'Total': 6}), 85000.0: Counter({'Total': 53}), 63000.0: Counter({'Total': 1190, 'submitted': 1000})} 2017-09-24 14:26:54,332:139752376227584:INFO:JobSubmitterPoller:Have 11 packages to submit. 2017-09-24 14:26:54,333:139752376227584:INFO:JobSubmitterPoller:Have 1000 jobs to submit. 2017-09-24 14:26:54,333:139752376227584:INFO:JobSubmitterPoller:Done assigning site locations. 2017-09-24 14:28:20,194:139752376227584:INFO:SimpleCondorPlugin:Done submitting jobs for this cycle in SimpleCondorPlugin 2017-09-24 14:28:21,237:139752376227584:INFO:JobSubmitterPoller:Jobs that succeeded/failed submission: 1000/0. 2017-09-24 14:28:24,099:139752376227584:INFO:DashboardReporter:Handling 1000 jobs 2017-09-24 14:32:12,139:139752376227584:INFO:JobSubmitterPoller:Transaction cycle successfully completed. 2017-09-24 14:34:16,114:139752376227584:INFO:JobSubmitterPoller:Refreshing priority cache with currently 186379 jobs 2017-09-24 14:34:16,800:139752376227584:INFO:JobSubmitterPoller:Skipping cache update to be submitted. (186379 job in cache) 2017-09-24 14:34:16,800:139752376227584:INFO:JobSubmitterPoller:Determining possible sites for new jobs... 2017-09-24 14:34:16,804:139752376227584:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit. 2017-09-24 14:34:19,245:139752376227584:INFO:JobSubmitterPoller:Site submission report: {'T1_DE_KIT': Counter({'submitted': 17}), u'T1_IT_CNAF': Counter({'submitted': 769}), 'T2_FR_IPHC': Counter({'NoPendingSlot': 34}), 'T2_FR_CCIN2P3': Counter({'NoPendingSlot': 59, 'submitted': 2}), 'T2_IT_Rome': Counter({'NoPendingSlot': 1}), 'T1_UK_RAL': Counter({'submitted': 40, 'NoPendingSlot': 22}), 'T1_US_FNAL': Counter({'NoPendingSlot': 869}), 'T2_DE_RWTH': Counter({'submitted': 8}), 'T2_IT_Legnaro': Counter({'submitted': 4}), 'T2_UK_London_Brunel': Counter({'NoPendingSlot': 31}), 'T2_BE_IIHE': Counter({'NoTaskSlot': 110}), 'T2_IT_Bari': Counter({'submitted': 54}), 'T2_US_Nebraska': Counter({'submitted': 20}), 'T2_US_Purdue': Counter({'submitted': 86})} 2017-09-24 14:34:19,246:139752376227584:INFO:JobSubmitterPoller:Priority submission report: {50110000.0: Counter({'Total': 179, 'submitted': 26}), 30063000.0: Counter({'Total': 3, 'submitted': 2}), 50085000.0: Counter({'Total': 869, 'submitted': 1}), 50090000.0: Counter({'Total': 2}), 50063000.0: Counter({'Total': 55, 'submitted': 4}), 30085000.0: Counter({'Total': 6, 'submitted': 3}), 85000.0: Counter({'Total': 53, 'submitted': 28}), 63000.0: Counter({'Total': 959, 'submitted': 936})} 2017-09-24 14:34:19,247:139752376227584:INFO:JobSubmitterPoller:Have 29 packages to submit. 2017-09-24 14:34:19,247:139752376227584:INFO:JobSubmitterPoller:Have 1000 jobs to submit. 2017-09-24 14:34:19,247:139752376227584:INFO:JobSubmitterPoller:Done assigning site locations.

Here is the Jira ticket:

https://its.cern.ch/jira/projects/CMSCOMPPR/issues/CMSCOMPPR-1350?filter=allopenissues

ticoann commented 7 years ago

@scarletnorberg, Scarlet, So JobSubmitter is just crashes without error messages? Message you posted seems to be generic log messages. Did you restart JobSubmitter? It seems it is working fine now. Could ping me if it happens again without restarting?

scarletnorberg commented 7 years ago

It is happening now. Wont restart. Looks like the same message.

On Mon, Sep 25, 2017 at 9:42 AM, ticoann notifications@github.com wrote:

@scarletnorberg https://github.com/scarletnorberg, Scarlet, So JobSubmitter is just crashes without error messages? Message you posted seems to be generic log messages. Did you restart JobSubmitter? It seems it is working fine now. Could ping me if it happens again without restarting?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmwm/WMCore/issues/8190#issuecomment-331903521, or mute the thread https://github.com/notifications/unsubscribe-auth/APKOXZmzL3JCsoYWMw0iXompsB6pfHcNks5sl7vigaJpZM4Ph61a .

--


Scarlet Norberg

Post doctoral Research Associate University of Puerto Rico scarletnorberg2014@gmail.com snorberg@fnal.gov snorberg@cern.ch skype: s.norberg1


ticoann commented 7 years ago

Thanks Scarlet, I changed to debug mode and restarted job Submitter.

scarletnorberg commented 7 years ago

Thanks!

On Mon, Sep 25, 2017 at 10:08 PM, ticoann notifications@github.com wrote:

Thanks Scarlet, I changed to debug mode and restarted job Submitter.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmwm/WMCore/issues/8190#issuecomment-332073471, or mute the thread https://github.com/notifications/unsubscribe-auth/APKOXTggft-H11U17i1UWk_iQ4osa6jNks5smGqVgaJpZM4Ph61a .

--


Scarlet Norberg

Post doctoral Research Associate University of Puerto Rico scarletnorberg2014@gmail.com snorberg@fnal.gov snorberg@cern.ch skype: s.norberg1


vlimant commented 7 years ago

@amaltaro @ticoann went down again I am afraid

ticoann commented 7 years ago

It seems it hang here. (However not all the time it succeed many times before it failed)

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L148

I think we need to look at the schedd log what happens. I will ask Farrukh.

Here is the last record we have.

2017-09-26 10:17:21,300:140359024850688:DEBUG:SimpleCondorPlugin:Start: Submitting 200 jobs using Condor Python SubmitMany
2017-09-26 10:17:26,579:140359024850688:DEBUG:SimpleCondorPlugin:Finish: Submitting jobs using Condor Python SubmitMany
2017-09-26 10:17:26,759:140359024850688:DEBUG:SimpleCondorPlugin:Start: Submitting 200 jobs using Condor Python SubmitMany
2017-09-26 10:17:41,324:140359024850688:DEBUG:SimpleCondorPlugin:Finish: Submitting jobs using Condor Python SubmitMany
2017-09-26 10:17:41,507:140359024850688:DEBUG:SimpleCondorPlugin:Start: Submitting 200 jobs using Condor Python SubmitMany
2017-09-26 10:17:47,643:140359024850688:DEBUG:SimpleCondorPlugin:Finish: Submitting jobs using Condor Python SubmitMany
2017-09-26 10:17:47,854:140359024850688:DEBUG:SimpleCondorPlugin:Start: Submitting 200 jobs using Condor Python SubmitMany
2017-09-26 10:17:52,866:140359024850688:DEBUG:SimpleCondorPlugin:Finish: Submitting jobs using Condor Python SubmitMany
2017-09-26 10:17:53,069:140359024850688:DEBUG:SimpleCondorPlugin:Start: Submitting 200 jobs using Condor Python SubmitMany
ticoann commented 7 years ago

JobSubmitter is restarted.

amaltaro commented 7 years ago

everyone in the P&R team should be getting those condor automatic emails. But since I've just got back from holidays, did people notice how many condor instabilities we have been facing these last days? I counted at least 50 emails within the last 6 days and - besides adding some protection to the WMCore code - somebody should follow this up with the SI team as well.

vlimant commented 7 years ago

As CRC, I have seen many emails about many things, including instability in couch, reqmgr, phedex, schedd. that is not a reason for crashing JobSubmitter in general. Unless we add an automatic restart. The shedd instability are very likely due to running them at the limit, because most of the time we were running on 2-3 agents (since the other ones were down in the dust)

vytjan commented 6 years ago

@amaltaro looks like this is related to the recent JobSubmitter deaths on the Tier0 agent. Or at least the component gets stuck in the same place - "using Condor Python SubmitMany"... It is sensitive for the Tier0, as we are creating delays with such misbehavior.

Copying the email below:

We are facing the JobSubmitter dying silently time after time on Tier0 production WMAgent instance (vocms0313, which is using the 1.1.12.patch3 version). We have enabled Debug logs for the component, but it looks like they are not really informative [1]. However, it looks that the component died/got stuck when submitting jobs using Condor Python SubmitMany method/algorithm. As Juan is on vacation now, you can contact me if any more information is needed. We agreed not to restart the component whenever we observe such case, but as it happened during the night time, we cannot allow Tier0 to be stopped for multiple hours due to delays.

[1]

2018-05-09 02:30:32,494:140574958565120:DEBUG:DashboardAPI:Sending info to dashboard for jobid: 0f65675e-4f97-425d-952f-2fb641a42435-137_0
2018-05-09 02:30:32,496:140574958565120:DEBUG:DashboardAPI:Sending info to dashboard for jobid: 0f65675e-4f97-425d-952f-2fb641a42435-138_0
2018-05-09 02:30:32,497:140574958565120:DEBUG:DashboardAPI:Sending info to dashboard for jobid: dc60f35c-307c-4086-b6fb-5abaee8d0af3-0_0
2018-05-09 02:30:50,394:140574958565120:DEBUG:JobSubmitterPoller:Propagating fail state to WMBS.
2018-05-09 02:30:50,395:140574958565120:DEBUG:JobSubmitterPoller:Updating job location...
2018-05-09 02:30:50,548:140574958565120:INFO:JobSubmitterPoller:Transaction cycle successfully completed.
2018-05-09 02:30:50,567:140574958565120:INFO:BaseWorkerThread:JobSubmitterPoller took 30.672 secs to execute
2018-05-09 02:30:50,881:140574958565120:DEBUG:LogDB:LogDB delete request, res=None
2018-05-09 02:31:21,290:140574958565120:DEBUG:LogDB:LogDB delete request, res=None
2018-05-09 02:31:21,595:140574958565120:DEBUG:LogDB:LogDB delete request, res=None
2018-05-09 02:31:21,697:140574958565120:INFO:JobSubmitterPoller:Refreshing priority cache with currently 0 jobs
2018-05-09 02:31:21,791:140574958565120:INFO:JobSubmitterPoller:Found 34 new jobs to be submitted.
2018-05-09 02:31:21,791:140574958565120:INFO:JobSubmitterPoller:Determining possible sites for new jobs...
2018-05-09 02:31:21,855:140574958565120:INFO:JobSubmitterPoller:Done pruning killed jobs, moving on to submit.
2018-05-09 02:31:21,858:140574958565120:INFO:JobSubmitterPoller:Site submission report: {'T2_CH_CERN': Counter({'submitted': 34})}
2018-05-09 02:31:21,858:140574958565120:INFO:JobSubmitterPoller:Priority submission report: {260000.0: Counter({'Total': 3, 'submitted': 3}), 30250000.0: Counter({'Total': 1, 'submitted': 1}), 50255000.0: Counter({'Total': 1, 'submitted': 1}), 30255000.0: Counter({'Total': 1, 'submitted': 1}), 250000.0: Counter({'Total': 8, 'submitted': 8}), 50250000.0: Counter({'Total': 20, 'submitted': 20})}
2018-05-09 02:31:21,859:140574958565120:INFO:JobSubmitterPoller:Have 10 packages to submit.
2018-05-09 02:31:21,859:140574958565120:INFO:JobSubmitterPoller:Have 34 jobs to submit.
2018-05-09 02:31:21,859:140574958565120:INFO:JobSubmitterPoller:Done assigning site locations.
2018-05-09 02:31:21,866:140574958565120:DEBUG:BossAirAPI:About to submit 34 jobs to plugin SimpleCondorPlugin
2018-05-09 02:31:21,944:140574958565120:DEBUG:SimpleCondorPlugin:Start: Submitting 34 jobs using Condor Python SubmitMany
2018-05-09 03:43:50,496:140358371514176:INFO:Harness:>>>Starting: JobSubmitter<<<
2018-05-09 03:43:50,497:140358371514176:INFO:Harness:>>>Initializing default database
2018-05-09 03:43:50,497:140358371514176:INFO:Harness:>>>Check if connection is through socket
2018-05-09 03:43:50,498:140358371514176:INFO:Harness:>>>Setting config for thread:
2018-05-09 03:43:50,498:140358371514176:INFO:Harness:>>>Building database connection string
2018-05-09 03:43:50,502:140358371514176:DEBUG:DBFactory:Using SQLAlchemy v.0.9.6
...
amaltaro commented 6 years ago

Vytas, since I'm time constrained until tomorrow evening, you might want to create this cronjob (download the script too :)) in the production T0 agent such that it gets automatically restarted if JobSubmitter goes quiet for > 15min: https://github.com/dmwm/WMCore/blob/master/bin/deploy-wmagent.sh#L355

Then we can start investigating it. This isn't something new and we have seen the component getting stuck in the submitMany in the past. Have you got any schedd alarms for that box? I have suspicious that submitMany might be having issues when schedd is unresponsive or something like that.

vytjan commented 6 years ago

I added the cronjob you mentioned to the vocms0313. Earlier this morning, the JobSubmitter died the same way on the vocms0313. However, there weren't any schedd alarms (neither any other alarms, besides the WMStats), so not sure if that was anyhow helpful.

bbockelm commented 6 years ago

That method invoked C++. Could you get a stack trace from it?

amaltaro commented 5 years ago

Let's observe how it does with the latest HTcondor version 8.8. There is some time since I last saw such problem.