Multi-threading issues - Githubissues

sophiemiddleton commented 1 year ago

Please look through the Production fcl files to look for inappropriate uses of multi-threading.

We should only use multi-threading when G4 is the dominant CPU time in the job. Consider the example that Alessandro showed on Wednesday, which is based off of

Production/JobConfig/pileup/MuStopPileup.fcl

This uses 2 threads and 2 schedules. Please set them both to 1.

This job is dominated by the time spent in PileupPath:TargetStopResampler:ResamplingMixer, an average of 2.3 seconds per event out of 2.4 total seconds per event. The time spent in G4 is only 0.2 seconds per event. The only module in this job that is parallelizable is the G4 module ( and maybe the event generator). So art serializes everything else. The net result is that when an instance of ResamplingMixer gets in, it blocks the other thread until it completes.

If you run with 1 thread the job completes in some amount of wall clock time. If you run with 2 threads it completes in very slightly less wall clock time but it is using 2 CPUs, not one. Each CPU is idle half of the time.

Let me know if you have any questions.

Thanks,

Rob

brownd1978 commented 1 year ago

Good job figuring this out Rob. The reason the pileup job is multithreaded is because it invokes G4. We decided at the beginning of MDC2020 to multithread all the G4 jobs. At that time we thought this was safe. Note that irreproducibility doesn’t affect the physics quality of the output.

Do you know what module is inappropriate for multi threading? I thought all modules that can’t support multithreading were run sequential, so I don’t understand how this job is irreproducibile. Is it coming from the g4 module itself? If so, does this point to a problem in how multithreaded G4 jobs get their seeds?

On Fri, Dec 9, 2022 at 19:27 Sophie Middleton @.***> wrote:

Please look through the Production fcl files to look for inappropriate uses of multi-threading.

We should only use multi-threading when G4 is the dominant CPU time in the job. Consider the example that Alessandro showed on Wednesday, which is based off of

Production/JobConfig/pileup/MuStopPileup.fcl

This uses 2 threads and 2 schedules. Please set them both to 1.

This job is dominated by the time spent in PileupPath:TargetStopResampler:ResamplingMixer, an average of 2.3 seconds per event out of 2.4 total seconds per event. The time spent in G4 is only 0.2 seconds per event. The only module in this job that is parallelizable is the G4 module ( and maybe the event generator). So art serializes everything else. The net result is that when an instance of ResamplingMixer gets in, it blocks the other thread until it completes.

If you run with 1 thread the job completes in some amount of wall clock time. If you run with 2 threads it completes in very slightly less wall clock time but it is using 2 CPUs, not one. Each CPU is idle half of the time.

Let me know if you have any questions.

Thanks,

Rob

— Reply to this email directly, view it on GitHub https://github.com/Mu2e/Production/issues/222, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH577U6WW5V7EU2YL3CJ3WMPE7JANCNFSM6AAAAAASZ4MHGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- David Brown @.*** Office Phone (510) 486-7261 Lawrence Berkeley National Lab M/S 50R5008 (50-6026C) Berkeley, CA 94720

rlcee commented 1 year ago

I thought the way this should work is that when MT is turned on, only modules that are explicitly declared as MT-ready can be run in MT mode. Geant is our only MT-ready module; all others are declared legacy or default to legacy The geant MT is reproducible. Therefore this MT job should be reproducible. Which statement is wrong?

kutschke commented 1 year ago

The most important thing is that Dave is right when he says that repeatability in the sense we are talking about here does not affect physics quality.

I hope that the following will answer the other questions in both Ray's and Dave's posts.

Both are right that Mu2eG4MT_module is our only MT-ready module. art assumes that all legacy modules are unsafe to run in parallel; so it serializes execution of those. It assumes that it is safe to run any number of MT ready modules in parallel. It also assumes that it is safe to run any one legacy module in parallel with any number of MT ready modules.

Consider a job with 2 threads. If thread 1 is running a legacy module and thread 2's next task is also a legacy module, then thread 2 blocks until the module on thread 1 is finished. If thread 1 is running a legacy module and thread 2's next task is an MT enabled module then the two threads will run in parallel. If thread 1 run is running an MT enabled module and thread 2's next task is run any module, MT enabled or not, then the module on thread 2 will run. That covers the full 2x2 matix of possibilities. If you go to 3 or more threads the analysis is similar: only one legacy module can be running at a time and any number of MT ready modules may be running in parallel with it.

The other important points is that there is some non-determinism in the scheduling algorithms and race conditions between threads.

I am not 100% sure about the legacy/MT status of the input and output modules but I do know that they can be active only in one thread at a time. They might have internal locking or they might rely on art's locking of legacy modules. In any case, they are not true MT; that's driven by limitations of root IO.

In a typical stage 1 job, the CPU time is dominated by the G4 module, often >90% of the time. And the art schedule has the form: (source module, some legacy modules, G4MT module, some more legacy modules, output module(s)). Most of the time the job will be executing two G4 threads in parallel. One of the threads will finish with G4; most of the time it will run through the rest of it's schedule, start the schedule for the next event and re-enter G4. All of the time the other thread was busy in G4.

From time to time both threads will be out of G4 and running through the other modules in their schedules. When that happens the legacy modules on the two threads will block each other. Roughly speaking the threads will alternate modules until one of the threads gets back to G4. During this period, execution speed drops to nominally 50%. ( I glossed over the fact that the order of execution of modules is not guaranteed to be strictly alternating between threads - you may get 2 modules from thread 1 followed by one module from thread 2 and then back to thread 1 ).

So that's the thread/schedule mechanics. How do we break the sequence of random numbers?

Depending on race conditions events, are not guaranteed to arrive at any particular legacy module in the same order on every run. When that happens the sequence of random numbers breaks.

We do reseed G4 every event in a deterministic way based on art::EventID. Issue Offline#849 (https://github.com/Mu2e/Offline/issues/849) discusses seeding all modules this way. This would fix the non-repeatability that Alessandro found. Aside: I misread Ray's analysis the first time: in his example it would add 0.25% to the time to process an event; I agree that we can tolerate that ( I had misread it as 25% which would not be acceptable - sorry for the confusion this caused ).

In the job that Alessandro commented on, G4 is only a tiny fraction of the total CPU time, so the job spends most of it's wall clock time with one thread active and one blocked. It would be best to run it single threaded.

I have not thought carefully about the intermediate case where we spend maybe 50% or 60% of the time with 2 threads both running G4. I bet that there is no clean optimal answer; I expect that it will depend on the properties of the jobs that other experiments are running.

Let me know if I missed anything in the earlier questions.

rlcee commented 1 year ago

The fact that I was missing was there was a legacy module with a random seed following the G4 module. In this case, I see the problem and agree it is the same as the 849 issue. Thanks for the thorough explanation. I see you commented about the random re-seed time. If that's OK, then maybe I can push that issue forward soon.

brownd1978 commented 1 year ago

Hi Rob,

On Sun, Dec 11, 2022 at 15:20 Rob Kutschke @.***> wrote:

The most important thing is that Dave is right when he says that repeatability in the sense we are talking about here does not affect physics quality.

I hope that the following will answer the other questions in both Ray's and Dave's posts.

Both are right that Mu2eG4MT_module is our only MT-ready module. art assumes that all legacy modules are unsafe to run in parallel; so it serializes execution of those. It assumes that it is safe to run any number of MT ready modules in parallel. It also assumes that it is safe to run any one legacy module in parallel with any number of MT ready modules.

Consider a job with 2 threads. If thread 1 is running a legacy module and thread 2's next task is also a legacy module, then thread 2 blocks until the module on thread 1 is finished. If thread 1 is running a legacy module and thread 2's next task is an MT enabled module then the two threads will run in parallel. If thread 1 run is running an MT enabled module and thread 2's next task is run any module, MT enabled or not, then the module on thread 2 will run. That covers the full 2x2 matix of possibilities. If you go to 3 or more threads the analysis is similar: only one legacy module can be running at a time and any number of MT ready modules may be running in parallel with it.

The other important points is that there is some non-determinism in the scheduling algorithms and race conditions between threads.

I am not 100% sure about the legacy/MT status of the input and output modules but I do know that they can be active only in one thread at a time. They might have internal locking or they might rely on art's locking of legacy modules. In any case, they are not true MT; that's driven by limitations of root IO.

In a typical stage 1 job, the CPU time is dominated by the G4 module, often >90% of the time. And the art schedule has the form: (source module, some legacy modules, G4MT module, some more legacy modules, output module(s)). Most of the time the job will be executing two G4 threads in parallel. One of the threads will finish with G4; most of the time it will run through the rest of it's schedule, start the schedule for the next event and re-enter G4. All of the time the other thread was busy in G4.

From time to time both threads will be out of G4 and running through the other modules in their schedules. When that happens the legacy modules on the two threads will block each other. Roughly speaking the threads will alternate modules until one of the threads gets back to G4. During this period, execution speed drops to nominally 50%. ( I glossed over the fact that the order of execution of modules is not guaranteed to be strictly alternating between threads - you may get 2 modules from thread 1 followed by one module from thread 2 and then back to thread 1 ).

So that's the thread/schedule mechanics. How do we break the sequence of random numbers?

Depending on race conditions events, are not guaranteed to arrive at any particular legacy module in the same order on every run. When that happens the sequence of random numbers breaks.

Pileup is a resampling job, there is no input event, so I don’t fully follow your logic. I guess you are saying any sequential module random number use in a G4 MT job will break reproducibility. That raises the question: if we precompute random numbers and run resampling with a ‘rnd’ input dataset (as we discussed at the production workshop) will that solve this problem? Or does it require a deeper fix? Note that we run every G4 job except POT as a resampler.

The simple fix is to update the global production G4 setting to not multithread. I will put in that PR today. If we decide to run MT just for POT we can do that in the POT job config. Dave

We do reseed G4 every event in a deterministic way based on art::EventID.

Issue Offline#849 (Mu2e/Offline#849 https://github.com/Mu2e/Offline/issues/849) discusses seeding all modules this way. This would fix the non-repeatability that Alessandro found. Aside: I misread Ray's analysis the first time: in his example it would add 0.25% to the time to process an event; I agree that we can tolerate that ( I had misread it as 25% which would not be acceptable - sorry for the confusion this caused ).

In the job that Alessandro commented on, G4 is only a tiny fraction of the total CPU time, so the job spends most of it's wall clock time with one thread active and one blocked. It would be best to run it single threaded.

I have not thought carefully about the intermediate case where we spend maybe 50% or 60% of the time with 2 threads both running G4. I bet that there is no clean optimal answer; I expect that it will depend on the properties of the jobs that other experiments are running.

Let me know if I missed anything in the earlier questions.

— Reply to this email directly, view it on GitHub https://github.com/Mu2e/Production/issues/222#issuecomment-1345686931, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH574KDTICC2L2OJ7MUFDWMZOTLANCNFSM6AAAAAASZ4MHGY . You are receiving this because you commented.Message ID: @.***>

-- David Brown @.*** Office Phone (510) 486-7261 Lawrence Berkeley National Lab M/S 50R5008 (50-6026C) Berkeley, CA 94720

sophiemiddleton commented 1 year ago

HI Everyone, Getting back to fixing these issues now. What is the status here?

kutschke commented 1 year ago

Hi Sophie,

Can you point me to the fcl that you will use for the campaign?  Can you remind me if you plan to run both stage 1 and stage 2.  Also, if running at NSERC do you plan to submit to CORI I or CORI II  (ie big core machines or KNL?)

Rob

From: Sophie Middleton @.> Reply-To: Mu2e/Production @.> Date: Thursday, January 5, 2023 at 9:31 AM To: Mu2e/Production @.> Cc: Rob Kutschke @.>, Comment @.***> Subject: Re: [Mu2e/Production] Multi-threading issues (Issue #222)

HI Everyone, Getting back to fixing these issues now. What is the status here?

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mu2e_Production_issues_222-23issuecomment-2D1372369243&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=Zi5IH-k2_JiK_Wipiv3pwwSXO9UNaHiLP8sisfPz__k&m=CzxshkA6OZoYYW9N_MW1uLxltv6ASe8UUIg52wN9eBZejmcYT0d_lil7Tswkg-nW&s=ToPVYu3_yfxS7_fSJfZ9staN1GcA-YmpxuMGfeSyFkY&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABHY5ZXE65GJ6K3JNNTALTTWQ3SNDANCNFSM6AAAAAASZ4MHGY&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=Zi5IH-k2_JiK_Wipiv3pwwSXO9UNaHiLP8sisfPz__k&m=CzxshkA6OZoYYW9N_MW1uLxltv6ASe8UUIg52wN9eBZejmcYT0d_lil7Tswkg-nW&s=f9_95aHN5huXcxKe_tNtbMAdd193BSbC1zpqqhQOBNM&e=. You are receiving this because you commented.Message ID: @.***>

brownd1978 commented 1 year ago

Hi Rob, Cori is being decommissioned this month, the replacement is Perlmutter. I haven’t yet tried to use Perlmutter but it is supposedly very similar. Dave

On Thu, Jan 5, 2023 at 07:39 Rob Kutschke @.***> wrote:

Hi Sophie,

Can you point me to the fcl that you will use for the campaign? Can you remind me if you plan to run both stage 1 and stage 2. Also, if running at NSERC do you plan to submit to CORI I or CORI II (ie big core machines or KNL?)

Rob

From: Sophie Middleton @.> Reply-To: Mu2e/Production @.> Date: Thursday, January 5, 2023 at 9:31 AM To: Mu2e/Production @.> Cc: Rob Kutschke @.>, Comment @.***> Subject: Re: [Mu2e/Production] Multi-threading issues (Issue #222)

HI Everyone, Getting back to fixing these issues now. What is the status here?

— Reply to this email directly, view it on GitHub< https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mu2e_Production_issues_222-23issuecomment-2D1372369243&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=Zi5IH-k2_JiK_Wipiv3pwwSXO9UNaHiLP8sisfPz__k&m=CzxshkA6OZoYYW9N_MW1uLxltv6ASe8UUIg52wN9eBZejmcYT0d_lil7Tswkg-nW&s=ToPVYu3_yfxS7_fSJfZ9staN1GcA-YmpxuMGfeSyFkY&e=>, or unsubscribe< https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABHY5ZXE65GJ6K3JNNTALTTWQ3SNDANCNFSM6AAAAAASZ4MHGY&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=Zi5IH-k2_JiK_Wipiv3pwwSXO9UNaHiLP8sisfPz__k&m=CzxshkA6OZoYYW9N_MW1uLxltv6ASe8UUIg52wN9eBZejmcYT0d_lil7Tswkg-nW&s=f9_95aHN5huXcxKe_tNtbMAdd193BSbC1zpqqhQOBNM&e=>.

You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/Mu2e/Production/issues/222#issuecomment-1372379399, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH573THPNR25CP5ZFHPMLWQ3TJNANCNFSM6AAAAAASZ4MHGY . You are receiving this because you commented.Message ID: @.***>

-- David Brown @.*** Office Phone (510) 486-7261 Lawrence Berkeley National Lab M/S 50R5008 (50-6026C) Berkeley, CA 94720

kutschke commented 1 year ago

Hi Rob, Cori is being decommissioned this month, the replacement is Perlmutter. I haven’t yet tried to use Perlmutter but it is supposedly very similar. Dave

Thanks Dave,

Perlmutter's CPU-only nodes have 128 cores and 512 GB of memory, so 4 GB/core.  That's a great fit for our jobs. 

I looked up the specs: https://docs.nersc.gov/systems/perlmutter/architecture/  .  I had thought that Perlmutter was intended to be mostly GPUs but I see that the design is that most of the nodes are dual CPU with no GPUs.  But there are indeed many nodes with 1 CPU plus 4 GPUS.

The GPU nodes also have 4 GB/core but I imagine we would rarely, if every, be scheduled on those nodes since we have no code that can use the GPUs.

In the future, we could target AI/DL training for the GPU nodes.

  Rob

rlcee commented 1 year ago

Cori is being decommissioned this month, the replacement is Perlmutter.

When we submit, we only say "site=NERSC" and where we land is determined by the agreements between computing and NERSC and/or matching the job ads. So I think to understand what is happening, we would need to talk to computing..

kutschke commented 1 year ago

I have a conversation on going with Steve Timm and will summarize here when it converges.

kutschke commented 1 year ago

Steve says that each core is hyperthreaded so there are 256 cores per node and 2 GB/core. So our G4 jobs should continue to use 2 threads and 2 schedules for memory reasons. Below is the rest of the thread:

Hi Rob--I already made a perlmutter entry for mu2e when I set it up, actually there are two entries, one for CPU and one for GPU.. actually the CPU nodes have 256 cores each. the one challenge is that you have only a 12 hour queue limit on perlmutter. that will eventually go up. As soon as the new FIFE allocation kicks in on Jan. 18 we will be glad to get you started. There will be a different DESIRED_Sites field to set in jobsub and everything else should be the same. Actually Perlmutter is already in "Production" and has been for a couple months but the startup has been shaky to say the least. But when it's running it is a very nice machine.

Steve Timm

From: Robert K Kutschke [kutschke@fnal.gov](mailto:kutschke@fnal.gov) Sent: Thursday, January 5, 2023 10:19 AM To: Andrew John Norman [anorman@fnal.gov](mailto:anorman@fnal.gov); Steven C Timm [timm@fnal.gov](mailto:timm@fnal.gov) Subject: About NERSC

Hi Guys,

Sophie Middleton is back from vacation and restart development towards running our next sim campaign at NERSC. I just learned that CORI will be shut down in a few weeks and that Perlmutter will soon be in production.

I checked the Perlmutter specs and see that their nodes are mostly 128 CPU cores with no GPUS 0 and 4 GB/core. So this is a good match to our needs; even better than a typical grid node. See https://docs.nersc.gov/systems/perlmutter/architecture/ .

What’s the status of Perlmutter access via HepCloud? Our jobs do not have code that can use the GPUS – I presume that there is a way to advertise that?
  Rob

Mu2e / Production

Multi-threading issues #222