cta-observatory / cta-lstchain

LST prototype testbench chain
https://cta-observatory.github.io/cta-lstchain/
BSD 3-Clause "New" or "Revised" License
25 stars 77 forks source link

error in `r0_to_dl1` should not write dl1 file #978

Open vuillaut opened 2 years ago

vuillaut commented 2 years ago

Here is an error happening with the new MC files:

lstchain_mc_r0_to_dl1 -f /fefs/aswg/workspace/georgios.voutsinas/AllSky/TrainingDataset/Protons/dec_931/sim_telarray/node_corsika_theta_31.589_az_122.714_/output_v1.4/simtel_corsika_theta_31.589_az_122.714_run192.simtel.gz -o . -c lstchain_conf.json 
Found duplicated column obs_id, skipping
Found duplicated column event_id, skipping
Traceback (most recent call last):
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/bin/lstchain_mc_r0_to_dl1", line 8, in <module>
    sys.exit(main())
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/lstchain/scripts/lstchain_mc_r0_to_dl1.py", line 74, in main
    r0_to_dl1.r0_to_dl1(
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/lstchain/reco/r0_to_dl1.py", line 412, in r0_to_dl1
    for i, event in enumerate(source):
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/ctapipe/io/eventsource.py", line 278, in __iter__
    for event in self._generator():
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/ctapipe/io/simteleventsource.py", line 376, in _generator
    yield from self._generate_events()
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/ctapipe/io/simteleventsource.py", line 393, in _generate_events
    for counter, array_event in enumerate(self.file_):
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/eventio/simtel/simtelfile.py", line 256, in iter_array_events
    self.next_low_level()
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/eventio/simtel/simtelfile.py", line 153, in next_low_level
    self.current_mc_shower = o.parse()
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/eventio/simtel/objects.py", line 1341, in parse
    mc['altitude'] = read_float(byte_stream)
  File "/fefs/aswg/software/conda/envs/lstchain-v0.9.6/lib/python3.8/site-packages/eventio/tools.py", line 28, in read_float
    return struct.unpack('<f', f.read(4))[0]
struct.error: unpack requires a buffer of 4 bytes

It results in a DL1 file with missing columns in the parameters table:

disp_dx
disp_dy
disp_norm
disp_angle
disp_sign
src_x
src_y

The simtel file is probably corrupted in some way (@voutsi) - but it happens with a lot of files !

However, I think the file should not be written at all when such an error occurs.

Voutsi commented 2 years ago

Hi @vuillaut , thanks for the feedback. This file is indeed corrupted, processing stopped at ~30k evt (not clear to me why, but i will check). In this specific directory, out of 420 files I see that 2 of them (the one you mentioned in the message and run 229) have this issue. Are they only these 2 runs that cause problems?

vuillaut commented 2 years ago

Hi @Voutsi Here is a list of faulty DL1 files I could retrieve, you can deduce the faulty simtel files:

/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/GammaDiffuse/dec_931/node_corsika_theta_37.428_az_114.435_/dl1_simtel_corsika_theta_37.428_az_114.435_run45.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/GammaDiffuse/dec_931/node_corsika_theta_26.36_az_133.808_/dl1_simtel_corsika_theta_26.36_az_133.808_run128.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_52.374_az_216.698_/dl1_simtel_corsika_theta_52.374_az_216.698_run287.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_52.374_az_197.973_/dl1_simtel_corsika_theta_52.374_az_197.973_run77.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_52.374_az_110.312_/dl1_simtel_corsika_theta_52.374_az_110.312_run215.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_43.197_az_230.005_/dl1_simtel_corsika_theta_43.197_az_230.005_run137.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_43.197_az_175.158_/dl1_simtel_corsika_theta_43.197_az_175.158_run224.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_60.528_az_146.4_/dl1_simtel_corsika_theta_60.528_az_146.4_run421.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_10.0_az_102.199_/dl1_simtel_corsika_theta_10.0_az_102.199_run407.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_43.197_az_262.712_/dl1_simtel_corsika_theta_43.197_az_262.712_run337.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_43.197_az_262.712_/dl1_simtel_corsika_theta_43.197_az_262.712_run373.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_14.984_az_355.158_/dl1_simtel_corsika_theta_14.984_az_355.158_run189.h5

/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_52.162_az_82.384_/dl1_simtel_corsika_theta_52.162_az_82.384_run25.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_37.661_az_270.641_/dl1_simtel_corsika_theta_37.661_az_270.641_run248.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_16.087_az_251.910_/dl1_simtel_corsika_theta_16.087_az_251.910_run176.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_66.446_az_284.017_/dl1_simtel_corsika_theta_66.446_az_284.017_run220.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_23.161_az_99.261_/dl1_simtel_corsika_theta_23.161_az_99.261_run212.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_37.661_az_89.359_/dl1_simtel_corsika_theta_37.661_az_89.359_run214.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_37.661_az_89.359_/dl1_simtel_corsika_theta_37.661_az_89.359_run505.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/Protons/dec_2276/node_theta_16.087_az_251.910_/dl1_simtel_corsika_theta_16.087_az_251.910_run961.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/Protons/dec_2276/node_theta_16.087_az_251.910_/dl1_simtel_corsika_theta_16.087_az_251.910_run461.h5
vuillaut commented 2 years ago
dec_2276/node_theta_30.390_az_266.360_/dl1_simtel_corsika_theta_30.390_az_266.360_run80.h5
maxnoe commented 2 years ago

This file is indeed corrupted, processing stopped at ~30k evt (not clear to me why, but i will check).

Was the CORSIKA run successful? Do you check for the END OF RUN marker in the CORSIKA log? According to the CORSIKA developers, that's the only way to make sure the CORSIKA run was complete. CORSIKA does not have an exit code != 0 in case of errors.

See e.g.:

https://github.com/fact-project/mopro3/blob/c5c0e47c40002292b12119c32056c04564585b06/mopro/processing/run_corsika.py#L93-L100

Voutsi commented 2 years ago

Hi @maxnoe yes the Corsika files are all finished successfully

vuillaut commented 2 years ago

Do we have a simple way to check for sim telarray successfulness?

maxnoe commented 2 years ago

dl1_simtel_corsika_theta_30.390_az_266.360_run80.h5

Why is there no particle in this filename? And wouldn't having leading zeros for the run be preferable?

Voutsi commented 2 years ago

Yes, it should end something like that: Finish data conversion ... Writing 10 histograms to output file. Closed output file. Finished. Mean pedestal noise = 4.42

Voutsi commented 2 years ago

However not all the files in the list you send are truncated. Some of them are successfully finished.

vuillaut commented 2 years ago

However not all the files in the list you send are truncated. Some of them are successfully finished.

could you give me one example please? So I can re-run r0_to_dl1 and check where it fails

Voutsi commented 2 years ago

However not all the files in the list you send are truncated. Some of them are successfully finished.

could you give me one example please? So I can re-run r0_to_dl1 and check where it fails

For example /fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_16.087_az251.910/dl1_simtel_corsika_theta_16.087_az_251.910_run176.h5

The corresponding simtel run finished successfully. (The file is /home/georgios.voutsinas/ws/AllSky/TrainingDataset/GammaDiffuse/dec_2276/sim_telarray/node_corsika_theta_16.087_az251.910/output_v1.4/simtel_corsika_theta_16.087_az_251.910_run176.simtel.gz

Voutsi commented 2 years ago

Do we have a simple way to check for sim telarray successfulness?

I will fish for truncated files and rerpoduce them.

maxnoe commented 2 years ago

@vuillaut From the call:

It seems you don't check lstchain's exit code in lstmcpipe. lstchain_mc_r0_to_dl1 should properly produce an exit code != 1 in case of error.

I don't agree that it should remove the output file in case of error as that might hinder debugging. That should be dealt with in lstmcpipe I think.

Looking at the simtel file of one of the problematic dl1 runs, I don't see any problem so it might have been an intermittent disk / network problem.

Can you try rerunning the runs?

Also, it might be preferable to first copy the input files to a local scratch / tmp disk.

morcuended commented 2 years ago

Also, it might be preferable to first copy the input files to a local scratch / tmp disk.

I tested this for processing observed data, and the copying takes a comparable time to that of the processing itself. Still, I would prefer to do it because we also suffer from I/O problems when the cluster is heavily used.

maxnoe commented 2 years ago

I tested this for processing observed data, and the copying takes a comparable time to that of the processing itself.

For data, this is less of a surprise, since zfits always reads 100 Events en-bloc (zfits tiles). So you don't gain much. But with simtel array files, you have to do many small reads (or maybe I can tweak the buffer size of the buffered reader a bit...). I will check.