ATLAS-Titan / PanDA-WMS-paper

Paper about PanDA architecture and characterization
0 stars 0 forks source link

Figure 4 -- modify: make current or remove March 17 ? #20

Closed shantenujha closed 7 years ago

shantenujha commented 7 years ago

I would either (i) remove March 2017 (partial) if it is not included in the experimental window, or (ii) redefine experimental window to include post-SC data, if data-to-date is easily available.

wellsjc commented 7 years ago

It is easy to produce this data through the end of June 2017. I am collecting it now and will update the Google doc. Whoever owns the figure in the paper will need to update.

wellsjc commented 7 years ago

Shantenu, Matteo,

The updated data, through end of June 2017, is attached is excel chart. The second chart, beginning on line 39 in the spreadsheet, is in the familiar “core-hour” unit, where there are 16 x86 core per Titan node and we are ignoring the GPUs. That is, the second chart has been scaled by 16/30.

I have updated the “backfill consumption” chart in google docs.

Jack

From: Shantenu notifications@github.com Reply-To: ATLAS-Titan/PanDA-WMS-paper reply@reply.github.com Date: Friday, July 7, 2017 at 4:15 PM To: ATLAS-Titan/PanDA-WMS-paper PanDA-WMS-paper@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [ATLAS-Titan/PanDA-WMS-paper] Figure 4 -- modify: make current or remove March 17 ? (#20)

I would either (i) remove March 2017 (partial) if it is not included in the experimental window, or (ii) redefine experimental window to include post-SC data, if data-to-date is easily available.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/ATLAS-Titan/PanDA-WMS-paper/issues/20, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVBOIRMPPoMrGaV7MBkXerYTXkSvUTbYks5sLpHygaJpZM4ORY8M.

mturilli commented 7 years ago

Hi Jack, thank you. I have plotted the new diagram attached to this message. I have two questions:

  1. Do we know why we had a dip in April 17?
  2. The percentages seem off, especially those of May and June. This make me think I am missing something about the meaning of the two bars and the percentages taken from the updated google doc.

About using the new diagram in the paper, Shantenu and I share the worry that without updating all the data related to the 'experiment time window' we would attract referees' criticisms. The data we would have to update are:

I am worried it might be too late to update these data?

mturilli commented 7 years ago

backfill_consumption_updated

wellsjc commented 7 years ago

Hi Mateo, we had a dip in Apirl 2017 because of a system software update on OLCF DTN that prevented jobs from starting. This occured when Danila was on vacation, camping with no internet connection. Therefore, it took several days before the appropriate updates could occur and fix the problem. Effectively, PanDA was out of operations for several days/ one week in April.

As shown in the google doc on backfill, the percentage is computed by (amount used in backfill) / (amout used in backfill + amount not used). This is how the percentage is computed.

I do not have a way to update the additional data you have listed. We would need Danila or Sergey to get this data.

shantenujha commented 7 years ago

Jack -- Did the breakage during Danila's absence take place during April? My memory seems to suggest it was later than that..

shantenujha commented 7 years ago

On Jun 7, 2017, at 4:34 PM, Danila Oleynik danila.oleynik@XXXX wrote:

(from campground with limited internet access) .
shantenujha commented 7 years ago

Also, I'm concerned that the grey bar is smaller in the same month that the blue bar is small. This points to a correlation and thus a systematic error in the calculations.

wellsjc commented 7 years ago

Shantenu,

Yes, you are correct. Sorry.

In April, there was a break in the work CERN “gave” to us. There was none for a period of time. This is what I remember.

Jack

From: Shantenu notifications@github.com Reply-To: ATLAS-Titan/PanDA-WMS-paper reply@reply.github.com Date: Friday, July 7, 2017 at 9:14 PM To: ATLAS-Titan/PanDA-WMS-paper PanDA-WMS-paper@noreply.github.com Cc: Jack Wells wellsjc@ornl.gov, Comment comment@noreply.github.com Subject: Re: [ATLAS-Titan/PanDA-WMS-paper] Figure 4 -- modify: make current or remove March 17 ? (#20)

On Jun 7, 2017, at 4:34 PM, Danila Oleynik danila.oleynik@gmail.commailto:danila.oleynik@gmail.com wrote:

(from campground with limited internet access) .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ATLAS-Titan/PanDA-WMS-paper/issues/20#issuecomment-313824355, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVBOIb1pL7QuKn-8MtYHYXkxQrcz7UZlks5sLtf6gaJpZM4ORY8M.

panitkin commented 7 years ago

Hi,

Shantenu is right - Danila's vacation happened in late May, early June.

April was bad for multiple reasons.

Atlas was switching to a new set of tasks that required installation of a new ATLAS software release.

So they stopped submitting new tasks to Titan and waited for old tasks to drain.

Switch to a new Atlas release revealed incompatibility with mpi4py on Titan.

After that problem was diagnosed and fixed and jobs from new tasks started to run we found out that they on average took longer than 2 hours to finish.

Most of the jobs failed, some tasks couldn't get even through scouting phase (first 10 jobs).

It took ATLAS a few days to agree on definition of jobs for Titan with factor of 2 less events per job than before (50 vs 100), to reduce run time.

After that jobs started to run OK and as additional benefit of that switch we started to utilized larger fraction of backfill in the following months.

Add to that Titan outage on April 8-10.

Yep, April was a tough month.

I attached a plot that shows daily job activity for April and my old weekly report to ATLAS management on production on Titan.

Cheers,

     Sergey

On Apr 13, 2017 01:42, "Sergey Panitkin" <panitkin@bnl.gov mailto:panitkin@bnl.gov> wrote: Hi Torre,

Last week Titan processed ~ 18 jobs and ~1.8k events.

Most of the week was lost due to completion of previous tasks, long maintenance shutdown on Titan and switch to new tasks with the new release 21.0.15.

Switch to a new , CMake based release was not problem free, starting from a new installation procedure to a new incompatibilities with Titan's software. But it is seems to be sorted out now .

New jobs started to flow in but a new problem popped up - current jobs run longer than the time limit for small submissions on Titan. So far most of the scout jobs failed due to this problem.

I asked Ivan Glushkov about possibility of shorter jobs, with 50 events per job instead of 100. I don't have his reply yet.

Cheers,

         Sergey

Cheers, Sergey

On 7/7/17 9:13 PM, Shantenu wrote:

Jack -- Did the breakage during Danila's absence take place during April? My memory seems to suggest it was later than that..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ATLAS-Titan/PanDA-WMS-paper/issues/20#issuecomment-313824253, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSIPtEzMmbWmb8gZZsfD4wGr_15Q664ks5sLteegaJpZM4ORY8M.

wellsjc commented 7 years ago

The grey bar is correct. Usage of available hours on Titan in April was high and unused hours were low. It was low not because of PanDA, but because of other users running on this machine.

Jack

From: panitkin notifications@github.com Reply-To: ATLAS-Titan/PanDA-WMS-paper reply@reply.github.com Date: Friday, July 7, 2017 at 11:41 PM To: ATLAS-Titan/PanDA-WMS-paper PanDA-WMS-paper@noreply.github.com Cc: Jack Wells wellsjc@ornl.gov, Comment comment@noreply.github.com Subject: Re: [ATLAS-Titan/PanDA-WMS-paper] Figure 4 -- modify: make current or remove March 17 ? (#20)

Hi,

Shantenu is right - Danila's vacation happened in late May, early June.

April was bad for multiple reasons.

Atlas was switching to a new set of tasks that required installation of a new ATLAS software release.

So they stopped submitting new tasks to Titan and waited for old tasks to drain.

Switch to a new Atlas release revealed incompatibility with mpi4py on Titan.

After that problem was diagnosed and fixed and jobs from new tasks started to run we found out that they on average took longer than 2 hours to finish.

Most of the jobs failed, some tasks couldn't get even through scouting phase (first 10 jobs).

It took ATLAS a few days to agree on definition of jobs for Titan with factor of 2 less events per job than before (50 vs 100), to reduce run time.

After that jobs started to run OK and as additional benefit of that switch we started to utilized larger fraction of backfill in the following months.

Add to that Titan outage on April 8-10.

Yep, April was a tough month.

I attached a plot that shows daily job activity for April and my old weekly report to ATLAS management on production on Titan.

Cheers,

Sergey

On Apr 13, 2017 01:42, "Sergey Panitkin" <panitkin@bnl.gov mailto:panitkin@bnl.gov> wrote: Hi Torre,

Last week Titan processed ~ 18 jobs and ~1.8k events.

Most of the week was lost due to completion of previous tasks, long maintenance shutdown on Titan and switch to new tasks with the new release 21.0.15.

Switch to a new , CMake based release was not problem free, starting from a new installation procedure to a new incompatibilities with Titan's software. But it is seems to be sorted out now .

New jobs started to flow in but a new problem popped up - current jobs run longer than the time limit for small submissions on Titan. So far most of the scout jobs failed due to this problem.

I asked Ivan Glushkov about possibility of shorter jobs, with 50 events per job instead of 100. I don't have his reply yet.

Cheers,

Sergey

Cheers, Sergey

On 7/7/17 9:13 PM, Shantenu wrote:

Jack -- Did the breakage during Danila's absence take place during April? My memory seems to suggest it was later than that..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ATLAS-Titan/PanDA-WMS-paper/issues/20#issuecomment-313824253, or mute the thread https://github.com/notifications/unsubscribe-auth/AGSIPtEzMmbWmb8gZZsfD4wGr_15Q664ks5sLteegaJpZM4ORY8M.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ATLAS-Titan/PanDA-WMS-paper/issues/20#issuecomment-313830950, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVBOIVPs_Zkge3z90G5zZk1_NwC55f5Zks5sLvp3gaJpZM4ORY8M.

shantenujha commented 7 years ago

Thank you Jack, Sergey.

I think we will stick with the old plots and add a note about an upward trend in usage, w/o providing data.