Closed alexbooth92 closed 2 months ago
Can you say/show a little more about the before and after performance for the ndlar simulation? In 2x2, we batch module-by-module (https://github.com/DUNE/larnd-sim/blob/485c1634b1d918637ff1d3071b3f4c104096fedd/cli/simulate_pixels.py#L806) and then for each module we simulate all events (https://github.com/DUNE/larnd-sim/blob/485c1634b1d918637ff1d3071b3f4c104096fedd/cli/simulate_pixels.py#L840) before moving onto the next module. This looks right by eye, under that batching scheme.
Can you say/show a little more about the before and after performance for the ndlar simulation? In 2x2, we batch module-by-module (
) and then for each module we simulate all events ( https://github.com/DUNE/larnd-sim/blob/485c1634b1d918637ff1d3071b3f4c104096fedd/cli/simulate_pixels.py#L840
) before moving onto the next module. This looks right by eye, under that batching scheme.
Unfortunately this isn't an attempt at an improvement in performance. I posted here in the larnd-sim
channel on slack about a possible bug. In the line that is linked through the linked issue #240 there is an assumption of 1 batch per event (1 batch per spill) baked in. My thinking is that this has been "invisible" for 2x2 productions recently because event_batch_size
has been set to 8 (equal to the number of TPCs) which would give you by coincidence 1 batch per event. This problem line of code wasn't there when I last ran larnd-sim
for NDLAr over Christmas.
What is the evidence you have that this line is a bug? We have been running with it in for 2x2 simulation since moving to module-by-module batching (event_batch_size = 2
), and it has been handling the data flow appropriately.
Hi @alexbooth92 and @krwood , following up the discussion here. I see the problem that Alex pointed out. In this line, it assumes event_batch_size
explicitly matches with the number of one module or all modules depending on if mod2mod_variation
is activated. The logic here is to select from all simulated events, so for example when you have 70 tpcs and setting event_batch_size
to 2, for one single event, you will have 35 batches. np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
is an array with one element if simulating one event.
Slight objection to the proposed solution, it will skip if there's no segments in the batch. However, for example in 2x2, we would still want to generate light signals even if there's no segments (depending on the light trigger scheme). See an alternative suggestion in the file comment.
Given that you are hitting memory issues, I would suggest you to set this event_batch_size
to 1 with a fix to the bug you pointed out before lower the batch_size
.
To add some proof - without either of the suggested fixes above and the following simulation_properties
and running over 3 events only with --n_events 3
:
batch_size: 2500 # track segments
event_batch_size: 2 # tpcs
You get the following - I have tried to make the print outs self explanatory:
******************
RUNNING SIMULATION
******************
Skipping non-active volumes... 0.20 s
Quenching electrons... 0.61 s
Drifting electrons... 0.20 s
Simulating batches...: 0%| | 0/105 [00:00<?, ?it/s]
PRINTING all_mod_tracks[sim.EVENT_SEPARATOR]
[1000 1000 1000 ... 1002 1002 1002]
PRINTING i_batch-1
0
PRITING np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
[1000 1001 1002]
PRITING len(np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])):
3
Simulating batches...: 1%|▏ | 1/105 [00:04<07:03, 4.07s/it]
PRINTING all_mod_tracks[sim.EVENT_SEPARATOR]
[1000 1000 1000 ... 1002 1002 1002]
PRINTING i_batch-1
1
PRITING np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
[1000 1001 1002]
PRITING len(np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])):
3
Simulating batches...: 2%|▍ | 2/105 [00:05<04:18, 2.51s/it]
PRINTING all_mod_tracks[sim.EVENT_SEPARATOR]
[1000 1000 1000 ... 1002 1002 1002]
PRINTING i_batch-1
2
PRITING np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
[1000 1001 1002]
PRITING len(np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])):
3
Simulating batches...: 3%|▌ | 3/105 [00:05<03:10, 1.87s/it]
PRINTING all_mod_tracks[sim.EVENT_SEPARATOR]
[1000 1000 1000 ... 1002 1002 1002]
PRINTING i_batch-1
3
PRITING np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
[1000 1001 1002]
PRITING len(np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])):
3
Traceback (most recent call last):
File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd.venv/bin/simulate_pixels.py", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd-sim/cli/simulate_pixels.py", line 1217, in <module>
fire.Fire(run_simulation)
File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/global/u1/a/abooth/dune/Production/MicroProdN1p1/2x2_sim/run-larnd-sim/larnd-sim/cli/simulate_pixels.py", line 819, in run_simulation
ievd = np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])[i_batch-1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
IndexError: index 3 is out of bounds for axis 0 with size 3
real 1m7.293s
user 0m11.653s
sys 0m51.578s
My understanding here is that np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
will return an array of unique "spill numbers". So entry 0 will be the first spill number, entry 1 the second etc. The indexing of this array therefore is nothing to do with batching unless you are doing one batch per spill.
I haven't tried @YifanC 's solution yet.
If I try @YifanC's solution verbatim I get a similar out of bounds error.
My understanding is this - the following:
all_mod_tracks[sim.EVENT_SEPARATOR]
is a list of event IDs / spill IDs with a length equal to the number of tracks in all modules.
np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])
is a list of event IDs / spill IDs with a length equal to the number of events or spills. i_batch
runs from 0 to the number of batches per event * number of events
which equals number of modules * event_batch_size * number of events
if and only if event_batch_size
is equal to the number of TPCs per module.
Yifan has highlighted that the fix here is two-fold. We need a piece of code that:
event_id
for the set of masked tracks that you are looking at track_subset = tracks[batch_mask]
and sets ievd
to it.There may be something more subtle that I am missing but I don't really understand why we have to use the batch indexing at all just to get the current event_id
when that relationship is complex. Why not just
ievd = track_subset["event_id"][0]
to do 1. as originally suggested? It also means we are not hardcoding any values of "2" anywhere in the code to bite us later.
To do 2., as Yifan pointed out the original solution would not simulate light in the case of no tracks but I think we could just remove the
else:
continue
The other proposed solution I think would just fall over in the case of no tracks? Although I haven't thought too much about it.
Hi @alexbooth92 , thanks for the follow up. Yea, in 2x2 for the light detector, the entire detector (all modules) are triggered together, so even there's no "signal", we would need to put some readout to it. The trigger scheme is not settled for ndlar yet. It may not be the same, but I would like to maintain the possibility to run 2x2 trigger scheme. If we simply remove
else:
continue
when the ievd
is required in the following code (for example, here), it will return errors.
Can you provide some details about how using ievd = np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])[ (i_batch * tpc_batch_size) // n_tpc_batch]
, it still gives you out-of-bound error? If you have that information ready, otherwise, I can try something on my end. This definitely is not a unique solution, but may yield a relatively small disturbance to other part of the code. I'd like to understand what's the issue here.
Hi @alexbooth92 , thanks for the follow up. Yea, in 2x2 for the light detector, the entire detector (all modules) are triggered together, so even there's no "signal", we would need to put some readout to it. The trigger scheme is not settled for ndlar yet. It may not be the same, but I would like to maintain the possibility to run 2x2 trigger scheme. If we simply remove
else: continue
when the
ievd
is required in the following code (for example, here), it will return errors. Can you provide some details about how usingievd = np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])[ (i_batch * tpc_batch_size) // n_tpc_batch]
, it still gives you out-of-bound error? If you have that information ready, otherwise, I can try something on my end. This definitely is not a unique solution, but may yield a relatively small disturbance to other part of the code. I'd like to understand what's the issue here.
Re. the out of bounds error with your solution, it might just be an off by one but I can't check right now because Perlmutter is down. I think I understand what you have tried to do I think it makes sense 🤔 saying that though we should not be hard coding anything. Factors of 2 in code scare me.
At risk of sounding like a broken record 😅 I still don't really understand why we need this complex bit of logic just to get the event ID...
You are right. My proposed solution only works when n_tpc_batch % event_batch_size == 0
and it is more restrictive than what the batch supports. Looking at the code I think it is okay to set event_batch_size
to 4 while have 70 tpcs in total, except the part we are talking about. On the other hand, the only difference is the end of a n_tpc_batch
cycle. For example, batch 16 is tpc 63-67; batch 17 is tpc 68-69; and batch 18 is tpc0-3. That's my understanding of the batching part. Therefore, we can do something like
batch_cycle = n_tpc_batch // tpc_batch_size + 1 (18 in the ndlar case)
ievd = np.unique(all_mod_tracks[sim.EVENT_SEPARATOR])[ i_batch // batch_cycle]
It should cover the case even n_tpc_batch % event_batch_size != 0
, although it's not very pretty. That been said, I haven't tested my understanding nor the stuff I proposed above. I might try later today.
Hey @alexbooth92, Oh I think I know why you get out of bound error for the change I suggested. When you get into sub-batch which controls by batch_size
(number of segments), you could have multiple batches for an event for a module. For the same reason, what I proposed earlier today would not work neither.
Sorry, and I totally agree with you that we can get event_id like ievd = track_subset["event_id"][0]
. We just need to figure out something if there's no tracks. Do you mind if I try push something in this branch?
Sorry, and I totally agree with you that we can get event_id like
ievd = track_subset["event_id"][0]
. We just need to figure out something if there's no tracks. Do you mind if I try push something in this branch?
Sure please do!
Hi @alexbooth92 , it ended up as a one liner. I added the event_id to the batch output, and I think this should be a general solution. I tested with a 2x2 file for 3 events by setting event_batch_size
to 1 (it can be any number), and the batching behaves as expected. Can you confirm that it works with a ndlar file? If so, we can merge it.
Hi @alexbooth92 , it ended up as a one liner. I added the event_id to the batch output, and I think this should be a general solution. I tested with a 2x2 file for 3 events by setting
event_batch_size
to 1 (it can be any number), and the batching behaves as expected. Can you confirm that it works with a ndlar file? If so, we can merge it.
I think this is a nice solution Yifan! I can confirm I no longer see the errors that I described above with the full NDLAr. Requesting permission again to merge :) @krwood
A couple of things in here that needed to be fixed before a second pass of
larnd-sim
with full ND geometry, spotted in debugging ofMiniProdN1.2
.While testing with new round of files I came across a bug, summarised in issue #240. This PR will close #240.