Open vuillaut opened 2 years ago
Hi @vuillaut , thanks for the feedback. This file is indeed corrupted, processing stopped at ~30k evt (not clear to me why, but i will check). In this specific directory, out of 420 files I see that 2 of them (the one you mentioned in the message and run 229) have this issue. Are they only these 2 runs that cause problems?
Hi @Voutsi Here is a list of faulty DL1 files I could retrieve, you can deduce the faulty simtel files:
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/GammaDiffuse/dec_931/node_corsika_theta_37.428_az_114.435_/dl1_simtel_corsika_theta_37.428_az_114.435_run45.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/GammaDiffuse/dec_931/node_corsika_theta_26.36_az_133.808_/dl1_simtel_corsika_theta_26.36_az_133.808_run128.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_52.374_az_216.698_/dl1_simtel_corsika_theta_52.374_az_216.698_run287.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_52.374_az_197.973_/dl1_simtel_corsika_theta_52.374_az_197.973_run77.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_52.374_az_110.312_/dl1_simtel_corsika_theta_52.374_az_110.312_run215.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_43.197_az_230.005_/dl1_simtel_corsika_theta_43.197_az_230.005_run137.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_43.197_az_175.158_/dl1_simtel_corsika_theta_43.197_az_175.158_run224.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_60.528_az_146.4_/dl1_simtel_corsika_theta_60.528_az_146.4_run421.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_10.0_az_102.199_/dl1_simtel_corsika_theta_10.0_az_102.199_run407.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_43.197_az_262.712_/dl1_simtel_corsika_theta_43.197_az_262.712_run337.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_43.197_az_262.712_/dl1_simtel_corsika_theta_43.197_az_262.712_run373.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_931_std/Crab/dec_931/node_theta_14.984_az_355.158_/dl1_simtel_corsika_theta_14.984_az_355.158_run189.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_52.162_az_82.384_/dl1_simtel_corsika_theta_52.162_az_82.384_run25.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_37.661_az_270.641_/dl1_simtel_corsika_theta_37.661_az_270.641_run248.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_16.087_az_251.910_/dl1_simtel_corsika_theta_16.087_az_251.910_run176.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_66.446_az_284.017_/dl1_simtel_corsika_theta_66.446_az_284.017_run220.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_23.161_az_99.261_/dl1_simtel_corsika_theta_23.161_az_99.261_run212.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_37.661_az_89.359_/dl1_simtel_corsika_theta_37.661_az_89.359_run214.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_37.661_az_89.359_/dl1_simtel_corsika_theta_37.661_az_89.359_run505.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/Protons/dec_2276/node_theta_16.087_az_251.910_/dl1_simtel_corsika_theta_16.087_az_251.910_run961.h5
/fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/Protons/dec_2276/node_theta_16.087_az_251.910_/dl1_simtel_corsika_theta_16.087_az_251.910_run461.h5
dec_2276/node_theta_30.390_az_266.360_/dl1_simtel_corsika_theta_30.390_az_266.360_run80.h5
This file is indeed corrupted, processing stopped at ~30k evt (not clear to me why, but i will check).
Was the CORSIKA run successful? Do you check for the END OF RUN
marker in the CORSIKA log? According to the CORSIKA developers, that's the only way to make sure the CORSIKA run was complete. CORSIKA does not have an exit code != 0 in case of errors.
See e.g.:
Hi @maxnoe yes the Corsika files are all finished successfully
Do we have a simple way to check for sim telarray successfulness?
dl1_simtel_corsika_theta_30.390_az_266.360_run80.h5
Why is there no particle in this filename? And wouldn't having leading zeros for the run be preferable?
Yes, it should end something like that: Finish data conversion ... Writing 10 histograms to output file. Closed output file. Finished. Mean pedestal noise = 4.42
However not all the files in the list you send are truncated. Some of them are successfully finished.
However not all the files in the list you send are truncated. Some of them are successfully finished.
could you give me one example please? So I can re-run r0_to_dl1
and check where it fails
However not all the files in the list you send are truncated. Some of them are successfully finished.
could you give me one example please? So I can re-run
r0_to_dl1
and check where it fails
For example /fefs/aswg/data/mc/DL1/AllSky/allsky_dec_2276_std/GammaDiffuse/dec_2276/node_corsika_theta_16.087_az251.910/dl1_simtel_corsika_theta_16.087_az_251.910_run176.h5
The corresponding simtel run finished successfully. (The file is /home/georgios.voutsinas/ws/AllSky/TrainingDataset/GammaDiffuse/dec_2276/sim_telarray/node_corsika_theta_16.087_az251.910/output_v1.4/simtel_corsika_theta_16.087_az_251.910_run176.simtel.gz
Do we have a simple way to check for sim telarray successfulness?
I will fish for truncated files and rerpoduce them.
@vuillaut From the call:
It seems you don't check lstchain's exit code in lstmcpipe. lstchain_mc_r0_to_dl1
should properly produce an exit code != 1 in case of error.
I don't agree that it should remove the output file in case of error as that might hinder debugging. That should be dealt with in lstmcpipe I think.
Looking at the simtel file of one of the problematic dl1 runs, I don't see any problem so it might have been an intermittent disk / network problem.
Can you try rerunning the runs?
Also, it might be preferable to first copy the input files to a local scratch / tmp disk.
Also, it might be preferable to first copy the input files to a local scratch / tmp disk.
I tested this for processing observed data, and the copying takes a comparable time to that of the processing itself. Still, I would prefer to do it because we also suffer from I/O problems when the cluster is heavily used.
I tested this for processing observed data, and the copying takes a comparable time to that of the processing itself.
For data, this is less of a surprise, since zfits always reads 100 Events en-bloc (zfits tiles). So you don't gain much. But with simtel array files, you have to do many small reads (or maybe I can tweak the buffer size of the buffered reader a bit...). I will check.
Here is an error happening with the new MC files:
It results in a DL1 file with missing columns in the parameters table:
The simtel file is probably corrupted in some way (@voutsi) - but it happens with a lot of files !
However, I think the file should not be written at all when such an error occurs.