Open DanilaOleynik opened 6 years ago
Workaround. Define very huge timeout for RequestHarvesterEvents, for example 30 minutes, 1 hour So if it really doesn’t have any events, the whole job will occupy all ranks for this timeout Small jobs will be very very low efficiency
In this job, transform has finished at 13:49. But until 13:53, it's still running.
-bash-4.2$ pwd /lustre/atlas/proj-shared/csc108/eventservice/harvester/harvester-wguan -bash-4.2$ vi workdir/harvester-messenger/23/yoda_droid_00026.log
2018-08-21 13:49:57|2433|46912515808640|00026|DEBUG|main|sleeping 60 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:49:58|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 4167,85 0% 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished 2018-08-21 13:53:39|2456|46912515808640|00026|INFO|pandayoda.droid.TransformManager|transform has finished
When receiving NO_MORE_EVENTS, Droid just stop all processes, without waiting AthenaMP to finish queued events. A Droid already got about 48 events 09:45:42, 09:45:45, and 09:48:02 normally one droid can process 32 events in 30 minutes, 48 events is enough for this rank But Droid still tried to get more events. When received NO_MORE_EVENTS at 09:50:07, Droid just stop all processes without waiting to finish the already received 48 events. Small gaps are targets in OLCF