NDLAr Specific Updates Before Second Pass (plus address issue #240)

alexbooth92 commented 3 months ago

A couple of things in here that needed to be fixed before a second pass of larnd-sim with full ND geometry, spotted in debugging of MiniProdN1.2.

Update TPC offset positions. See detailed comment for how they were obtained.
Update batch size to reflect what I was able to run with on the 80 GB nodes. Unfortunately it is low, much lower than the number of segments per spill.

While testing with new round of files I came across a bug, summarised in issue #240. This PR will close #240.

krwood commented 3 months ago

Can you say/show a little more about the before and after performance for the ndlar simulation? In 2x2, we batch module-by-module (https://github.com/DUNE/larnd-sim/blob/485c1634b1d918637ff1d3071b3f4c104096fedd/cli/simulate_pixels.py#L806) and then for each module we simulate all events (https://github.com/DUNE/larnd-sim/blob/485c1634b1d918637ff1d3071b3f4c104096fedd/cli/simulate_pixels.py#L840) before moving onto the next module. This looks right by eye, under that batching scheme.

alexbooth92 commented 3 months ago

Can you say/show a little more about the before and after performance for the ndlar simulation? In 2x2, we batch module-by-module (

https://github.com/DUNE/larnd-sim/blob/485c1634b1d918637ff1d3071b3f4c104096fedd/cli/simulate_pixels.py#L806

) and then for each module we simulate all events ( https://github.com/DUNE/larnd-sim/blob/485c1634b1d918637ff1d3071b3f4c104096fedd/cli/simulate_pixels.py#L840

) before moving onto the next module. This looks right by eye, under that batching scheme.

Unfortunately this isn't an attempt at an improvement in performance. I posted here in the larnd-sim channel on slack about a possible bug. In the line that is linked through the linked issue #240 there is an assumption of 1 batch per event (1 batch per spill) baked in. My thinking is that this has been "invisible" for 2x2 productions recently because event_batch_size has been set to 8 (equal to the number of TPCs) which would give you by coincidence 1 batch per event. This problem line of code wasn't there when I last ran larnd-sim for NDLAr over Christmas.

krwood commented 3 months ago

What is the evidence you have that this line is a bug? We have been running with it in for 2x2 simulation since moving to module-by-module batching (event_batch_size = 2), and it has been handling the data flow appropriately.

YifanC commented 3 months ago

Hi @alexbooth92 and @krwood , following up the discussion here. I see the problem that Alex pointed out. In this line, it assumes event_batch_size explicitly matches with the number of one module or all modules depending on if mod2mod_variation is activated. The logic here is to select from all simulated events, so for example when you have 70 tpcs and setting event_batch_size to 2, for one single event, you will have 35 batches. np.unique(all_mod_tracks[sim.EVENT_SEPARATOR]) is an array with one element if simulating one event. Slight objection to the proposed solution, it will skip if there's no segments in the batch. However, for example in 2x2, we would still want to generate light signals even if there's no segments (depending on the light trigger scheme). See an alternative suggestion in the file comment. Given that you are hitting memory issues, I would suggest you to set this event_batch_size to 1 with a fix to the bug you pointed out before lower the batch_size.

alexbooth92 commented 3 months ago

To add some proof - without either of the suggested fixes above and the following simulation_properties and running over 3 events only with --n_events 3:

batch_size: 2500             # track segments
event_batch_size: 2          # tpcs

You get the following - I have tried to make the print outs self explanatory:

******************
RUNNING SIMULATION
******************
Skipping non-active volumes... 0.20 s
Quenching electrons... 0.61 s
Drifting electrons... 0.20 s
Simulating batches...:   0%|                            | 0/105 [00:00<?, ?it/s]
PRINTING all_mod_tracks[sim.EVENT_SEPARATOR]
[1000 1000 1000 ... 1002 1002 1002]
PRINTING i_batch-1
0
PRITING np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
[1000 1001 1002]
PRITING len(np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])):
3
Simulating batches...:   1%|▏                   | 1/105 [00:04<07:03,  4.07s/it]
PRINTING all_mod_tracks[sim.EVENT_SEPARATOR]
[1000 1000 1000 ... 1002 1002 1002]
PRINTING i_batch-1
1
PRITING np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
[1000 1001 1002]
PRITING len(np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])):
3
Simulating batches...:   2%|▍                   | 2/105 [00:05<04:18,  2.51s/it]
PRINTING all_mod_tracks[sim.EVENT_SEPARATOR]
[1000 1000 1000 ... 1002 1002 1002]
PRINTING i_batch-1
2
PRITING np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
[1000 1001 1002]
PRITING len(np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])):
3
Simulating batches...:   3%|▌                   | 3/105 [00:05<03:10,  1.87s/it]
PRINTING all_mod_tracks[sim.EVENT_SEPARATOR]
[1000 1000 1000 ... 1002 1002 1002]
PRINTING i_batch-1
3
PRITING np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
[1000 1001 1002]
PRITING len(np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])):
3
Traceback (most recent call last):
  File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd.venv/bin/simulate_pixels.py", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd-sim/cli/simulate_pixels.py", line 1217, in <module>
    fire.Fire(run_simulation)
  File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd-sim/cli/simulate_pixels.py", line 819, in run_simulation
    ievd = np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])[i_batch-1]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
IndexError: index 3 is out of bounds for axis 0 with size 3

real    1m7.293s
user    0m11.653s
sys     0m51.578s

My understanding here is that np.unique(all_mod_tracks[sim.EVENT_SEPARATOR]) will return an array of unique "spill numbers". So entry 0 will be the first spill number, entry 1 the second etc. The indexing of this array therefore is nothing to do with batching unless you are doing one batch per spill.

I haven't tried @YifanC 's solution yet.

alexbooth92 commented 3 months ago

If I try @YifanC's solution verbatim I get a similar out of bounds error.

My understanding is this - the following:

all_mod_tracks[sim.EVENT_SEPARATOR]

is a list of event IDs / spill IDs with a length equal to the number of tracks in all modules.

np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])

is a list of event IDs / spill IDs with a length equal to the number of events or spills. i_batch runs from 0 to the number of batches per event * number of events which equals number of modules * event_batch_size * number of events if and only if event_batch_size is equal to the number of TPCs per module.

Yifan has highlighted that the fix here is two-fold. We need a piece of code that:

Gives you the current event_id for the set of masked tracks that you are looking at track_subset = tracks[batch_mask] and sets ievd to it.
Handles the case that there are no tracks in your subset in some way.

There may be something more subtle that I am missing but I don't really understand why we have to use the batch indexing at all just to get the current event_id when that relationship is complex. Why not just

ievd = track_subset["event_id"][0]

to do 1. as originally suggested? It also means we are not hardcoding any values of "2" anywhere in the code to bite us later.

To do 2., as Yifan pointed out the original solution would not simulate light in the case of no tracks but I think we could just remove the

else:
    continue

The other proposed solution I think would just fall over in the case of no tracks? Although I haven't thought too much about it.

YifanC commented 3 months ago

Hi @alexbooth92 , thanks for the follow up. Yea, in 2x2 for the light detector, the entire detector (all modules) are triggered together, so even there's no "signal", we would need to put some readout to it. The trigger scheme is not settled for ndlar yet. It may not be the same, but I would like to maintain the possibility to run 2x2 trigger scheme. If we simply remove

else:
    continue

when the ievd is required in the following code (for example, here), it will return errors. Can you provide some details about how using ievd = np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])[ (i_batch * tpc_batch_size) // n_tpc_batch], it still gives you out-of-bound error? If you have that information ready, otherwise, I can try something on my end. This definitely is not a unique solution, but may yield a relatively small disturbance to other part of the code. I'd like to understand what's the issue here.

alexbooth92 commented 3 months ago

Hi @alexbooth92 , thanks for the follow up. Yea, in 2x2 for the light detector, the entire detector (all modules) are triggered together, so even there's no "signal", we would need to put some readout to it. The trigger scheme is not settled for ndlar yet. It may not be the same, but I would like to maintain the possibility to run 2x2 trigger scheme. If we simply remove
else:
    continue
when the ievd is required in the following code (for example, here), it will return errors. Can you provide some details about how using ievd = np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])[ (i_batch * tpc_batch_size) // n_tpc_batch], it still gives you out-of-bound error? If you have that information ready, otherwise, I can try something on my end. This definitely is not a unique solution, but may yield a relatively small disturbance to other part of the code. I'd like to understand what's the issue here.

Re. the out of bounds error with your solution, it might just be an off by one but I can't check right now because Perlmutter is down. I think I understand what you have tried to do I think it makes sense 🤔 saying that though we should not be hard coding anything. Factors of 2 in code scare me.

At risk of sounding like a broken record 😅 I still don't really understand why we need this complex bit of logic just to get the event ID...

YifanC commented 3 months ago

You are right. My proposed solution only works when n_tpc_batch % event_batch_size == 0 and it is more restrictive than what the batch supports. Looking at the code I think it is okay to set event_batch_size to 4 while have 70 tpcs in total, except the part we are talking about. On the other hand, the only difference is the end of a n_tpc_batch cycle. For example, batch 16 is tpc 63-67; batch 17 is tpc 68-69; and batch 18 is tpc0-3. That's my understanding of the batching part. Therefore, we can do something like

batch_cycle = n_tpc_batch // tpc_batch_size + 1 (18 in the ndlar case)
ievd = np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])[ i_batch // batch_cycle]

It should cover the case even n_tpc_batch % event_batch_size != 0, although it's not very pretty. That been said, I haven't tested my understanding nor the stuff I proposed above. I might try later today.

YifanC commented 3 months ago

Hey @alexbooth92, Oh I think I know why you get out of bound error for the change I suggested. When you get into sub-batch which controls by batch_size (number of segments), you could have multiple batches for an event for a module. For the same reason, what I proposed earlier today would not work neither.

YifanC commented 3 months ago

Sorry, and I totally agree with you that we can get event_id like ievd = track_subset["event_id"][0]. We just need to figure out something if there's no tracks. Do you mind if I try push something in this branch?

alexbooth92 commented 3 months ago

Sorry, and I totally agree with you that we can get event_id like ievd = track_subset["event_id"][0]. We just need to figure out something if there's no tracks. Do you mind if I try push something in this branch?

Sure please do!

YifanC commented 3 months ago

Hi @alexbooth92 , it ended up as a one liner. I added the event_id to the batch output, and I think this should be a general solution. I tested with a 2x2 file for 3 events by setting event_batch_size to 1 (it can be any number), and the batching behaves as expected. Can you confirm that it works with a ndlar file? If so, we can merge it.

alexbooth92 commented 3 months ago

Hi @alexbooth92 , it ended up as a one liner. I added the event_id to the batch output, and I think this should be a general solution. I tested with a 2x2 file for 3 events by setting event_batch_size to 1 (it can be any number), and the batching behaves as expected. Can you confirm that it works with a ndlar file? If so, we can merge it.

I think this is a nice solution Yifan! I can confirm I no longer see the errors that I described above with the full NDLAr. Requesting permission again to merge :) @krwood

DUNE / larnd-sim

NDLAr Specific Updates Before Second Pass (plus address issue #240) #239