LDMX-Software / LDCS

Lightweight Distributed Computing System repo
0 stars 3 forks source link

Corrupted file for kaons #23

Closed tvami closed 2 weeks ago

tvami commented 2 weeks ago

When running on the kaons I get a segfault (see below).

I think we should just deleted this file

 /fs/ddn/sdf/group/ldmx/data/mc24/v14-8gev/8.0GeV/v3.3.3_kaon-batch1/mc_v14-8gev-8.0GeV-1e-ecal_photonuclear_kaon_run1391_t1709367110.root 

@bryngemark can you please do that?

Error in <TFile::Init>: file /fs/ddn/sdf/group/ldmx/data/mc24/v14-8gev/8.0GeV/v3.3.3_kaon-batch1/mc_v14-8gev-8.0GeV-1e-ecal_photonuclear_kaon_run1391_t1709367110.root is truncated at 238026752 bytes: should be 239397538, trying to recover
Info in <TFile::Recover>: /fs/ddn/sdf/group/ldmx/data/mc24/v14-8gev/8.0GeV/v3.3.3_kaon-batch1/mc_v14-8gev-8.0GeV-1e-ecal_photonuclear_kaon_run1391_t1709367110.root, recovered key TTree:LDMX_Events at address 236590381
Warning in <TFile::Init>: successfully recovered 1 keys
Error in <TFile::ReadBuffer>: error reading all requested bytes from file /fs/ddn/sdf/group/ldmx/data/mc24/v14-8gev/8.0GeV/v3.3.3_kaon-batch1/mc_v14-8gev-8.0GeV-1e-ecal_photonuclear_kaon_run1391_t1709367110.root, got 0 of 8254
Warning in <TFile::GetRecordHeader>: /fs/ddn/sdf/group/ldmx/data/mc24/v14-8gev/8.0GeV/v3.3.3_kaon-batch1/mc_v14-8gev-8.0GeV-1e-ecal_photonuclear_kaon_run1391_t1709367110.root: failed to read the StreamerInfo data from disk.
Error in <TTreeReader::TTreeReader>: No TTree called LDMX_Run was found in the selected TDirectory.
Error in <TFile::ReadBuffer>: error reading all requested bytes from file /fs/ddn/sdf/group/ldmx/data/mc24/v14-8gev/8.0GeV/v3.3.3_kaon-batch1/mc_v14-8gev-8.0GeV-1e-ecal_photonuclear_kaon_run1391_t1709367110.root, got 0 of 8254
Warning in <TFile::GetRecordHeader>: /fs/ddn/sdf/group/ldmx/data/mc24/v14-8gev/8.0GeV/v3.3.3_kaon-batch1/mc_v14-8gev-8.0GeV-1e-ecal_photonuclear_kaon_run1391_t1709367110.root: failed to read the StreamerInfo data from disk.

 *** Break *** segmentation violation

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f9bd1e793ea in __GI___wait4 (pid=933911, stat_loc=stat_loc
entry=0x7ffd69a4b1d8, options=options
entry=0, usage=usage
entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
#1  0x00007f9bd1e793ab in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc
entry=0x7ffd69a4b1d8, options=options
entry=0) at ./posix/waitpid.c:38
#2  0x00007f9bd1ddfbdb in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:171
#3  0x00007f9bd2510334 in TUnixSystem::StackTrace() () from /usr/local/lib/libCore.so
#4  0x00007f9bd250d665 in TUnixSystem::DispatchSignals(ESignals) () from /usr/local/lib/libCore.so
#5  <signal handler called>
#6  0x00007f9bd10ef00e in TBranchElement::FindOnfileInfo(TClass*, TObjArray const&) const () from /usr/local/lib/libTree.so
#7  0x00007f9bd10f9fee in TBranchElement::InitInfo() () from /usr/local/lib/libTree.so
#8  0x00007f9bd10ead72 in TBranchElement::SetupAddressesImpl() () from /usr/local/lib/libTree.so
#9  0x00007f9bd10f4152 in TBranchElement::GetEntry(long long, int) () from /usr/local/lib/libTree.so
#10 0x00007f9bd1159c85 in TTree::GetEntry(long long, int) () from /usr/local/lib/libTree.so
#11 0x00007f9bd27fcbcc in framework::EventFile::nextEvent(bool) () from /sdf/home/t/tamasvami/CutBasedDM/ldmx-sw/install/lib/libFramework.so
#12 0x00007f9bd2816689 in framework::Process::run() () from /sdf/home/t/tamasvami/CutBasedDM/ldmx-sw/install/lib/libFramework.so
#13 0x00005652cb04f11d in main ()
===========================================================
bryngemark commented 2 weeks ago

hi @tvami, i can delete it. is this the only corrupted file in the kaon dataset?

tvami commented 2 weeks ago

Yes, I should rename the issue to remove the plural, I was ready to have a longer list, but this seems to be the only one. Thanks for deleting it @bryngemark !

tvami commented 2 weeks ago

I actually have a bigger picture question too: when LDCS is used and you have several files as input, but one of them is corrupted, how is that handled? Because for me I got this segfault and my other dozens of files were ignored since the final output file was bad.

tvami commented 2 weeks ago

File is deleted now

bryngemark commented 2 weeks ago

I actually have a bigger picture question too: when LDCS is used and you have several files as input, but one of them is corrupted, how is that handled? Because for me I got this segfault and my other dozens of files were ignored since the final output file was bad.

This happens with LDCS too. Fixing it such that we can deal gracefully with corrupted files would help a lot, thanks for working on that.