DUNE / larnd-sim

Simulation framework for a pixelated Liquid Argon TPC
Apache License 2.0
10 stars 29 forks source link

Save to file after each event #58

Closed soleti closed 2 years ago

soleti commented 2 years ago

This PR changes the way we save the result of the simulation to file by doing it after each event, and not at the end of the full simulation. Fixes issue #57, but it's slightly less efficient, since it has to copy from the GPU memory after each event.

soleti commented 2 years ago

@peter-madigan for some reason I can't add you as a reviewer but I would appreciate if you could take a quick look.

chenel commented 2 years ago

Unfortunately the sometimes-empty events of the official "ND-LAr+TMS" simulation seem to be causing a problem here:


  File "cli/simulate_pixels.py", line 417, in <module>
    fire.Fire(run_simulation)
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "cli/simulate_pixels.py", line 380, in run_simulation
    event_id_list_batch = np.concatenate(event_id_list, axis=0)
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate```
soleti commented 2 years ago

Can you try now @chenel?

chenel commented 2 years ago

Progress! I now get through the first 262 events of my ~10K event sample. Unfortunately I'm running out of memory now on my ~11GB VRAM GPU. :(

I'm going to try on a machine with a better GPU (more VRAM), but I post this here just in case it is evidence something else might be wrong...

chenel commented 2 years ago

sad panda. about 30% through file (event 2634/8581):

Traceback (most recent call last):                                              
  File "cli/simulate_pixels.py", line 418, in <module>
    fire.Fire(run_simulation)
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "cli/simulate_pixels.py", line 387, in run_simulation
    _, _, last_time = fee.export_to_hdf5(event_id_list_batch,
  File "/gpfs/slac/staas/fs1/g/neutrino/jwolcott/app/larnd-sim/larndsim/fee.py", line 207, in export_to_hdf5
    io_group = detector.MODULE_TO_IO_GROUPS[module_id][io_group-1]
KeyError: 0

I don't see any other output for this particular event.

peter-madigan commented 2 years ago

Sorry for the slow response - I don't have my computer with me this week, but I'll take a look as soon as I'm back.

soleti commented 2 years ago

@chenel can you send me the path of your input file? when it crashes, does the file contains the events simulated so far?

chenel commented 2 years ago

(for the record, file was sent via Slack. there is an output file, which is generally healthy, but it's missing the tracks product. apparently that's still being saved at the end.)

soleti commented 2 years ago

Ok there was a missing check in the pixel finding algorithm. Now it should work, let me know if it doesn't.

chenel commented 2 years ago

I'll set a test running.

chenel commented 2 years ago

So close!

Simulating events...: 100%|███████████████| 8581/8581 [2:08:07<00:00,  1.12it/s]
Traceback (most recent call last):                                              
  File "cli/simulate_pixels.py", line 413, in <module>
    fire.Fire(run_simulation)
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/dist-packages/fire-0.4.0-py3.8.egg/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "cli/simulate_pixels.py", line 404, in run_simulation
    output_file['configs'].attrs['pixel_layout'] = pixel_layout
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/group.py", line 288, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
ValueError: Invalid location identifier (invalid location identifier)

Did I miss updating something somehow?

soleti commented 2 years ago

Oops, I forgot to open the file before writing a config, now it should work 🤞

chenel commented 2 years ago

Victory at last! Finished successfully and file seems to be healthy. 🎉 (I don't understand why there are 21616 packets with packet_type of 7---I thought this were supposed to be event boundaries only?---given there are only 10K events in the edep-sim file, but unless it's likely to be evidence that something went wrong in saving, we can move the discussion elsewhere.)

soleti commented 2 years ago

Those are trigger packets, not just event dividers, you can have more than one per event. I'll merge this and eventually investigate more.