Open mmusich opened 2 months ago
cms-bot internal usage
A new Issue was created by @mmusich.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign CommonTools/UtilAlgos
New categories assigned: reconstruction
@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
Since
calls TFile::Open()
, the I/O gets rerouted through our StorageFactory
layer, which is also visible in the error message
Warning in <TStorageFactoryFile::Write>: file root://eoscms.cern.ch//eos/cms/store/group/alca_trackeralign/AlignmentValidation/AlignmentValidation/2024_CDE_ReReco_mp3949_splitV_379525/SplitV/single/GT/compare2024/379525/SplitV.root not opened in write mode
With the StorageFactory
, root://
URLs lead to our XrdAdaptor
layer to be used for the actual I/O. On a quick look the XrdAdaptor
code looks like it should be able to deal with writing files too, but I'd guess the writing part hasn't been tested much (since we use xrootd predominantly for reading data).
It may be worth of noting here that writing to (CERN) EOS through the FUSE mount has an "interesting" behavior as well https://github.com/cms-sw/cmssw/issues/44369 (ROOT internally transforms the local-looking path into a root://
URL, while the StorageFactory
layer continues act like the file would be local.
assign core
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
Would you be able to try if adding
process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))
to the job configuration would impact the behavior? (this prevents CMSSW to register the StorageFactory
+ XrdAdaptor
for root
protocol)
type root
@pcanal Could there be some error condition (or other assumption) in TStreamerInfo::ForceWriteInfo()
that could lead it to segfault instead of reporting an error when writing to CERN EOS through xrootd?
I can see two possibility. One is that the ROOT build being used does not have the code from https://github.com/root-project/root/pull/13842.
The other, more likely, is that writing in a file open in read-only mode might not be failing elegantly .... i.e.
Warning in <TStorageFactoryFile::Write>: file root://eoscms.cern.ch//eos/....SplitV.root not opened in write mode
when/if the file was open with "RECREATE" indicates that something 'bad' happened during the TFile::Open
(and undefined behavior might be a consequence thereof).
One possibility is that the file is seen/thought-of as non-writeable (for example issue with permissions) and that some part of the logic in or around TFile::Open
is silently falling back to opening the file in read-only mode.
Would you be able to try if adding
process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))
to the job configuration would impact the behavior? (this prevents CMSSW to register the StorageFactory + XrdAdaptor for root protocol)
indeed, adding this line in the configuration file, the segmentation fault is prevented. Thank you.
@mmusich is this issue solved? Thanks
@jfernan2
is this issue solved?
I am not sure. With the workaround at https://github.com/cms-sw/cmssw/issues/46024#issuecomment-2356593133 this particular instance of the problem is solved, though I can't say if that's a design feature or a bug.
I have a naive question concerning the expected behavior of
TFileService
when it's configured to (over-)write files on eos. While trying to re-run some alignment related jobs @henriettepetersen reported a segmentation fault inSplitVertexResolution
, stack trace below:(a reproducer is available at
/afs/cern.ch/work/h/hpeterse/public/splitV_seg_fault
, by copying locally the folder in any recent cmssw release and then runningcmsRun validation_cfg.py config=validation.json
).The issue seems to be related to the fact that the file that we're trying to write already exists with the same name at the same location. In particular the segmentation fault originates here:
https://github.com/cms-sw/cmssw/blob/e54f434a789ce6ffaf824788a7b9bf79a50adf1a/Alignment/OfflineValidation/plugins/SplitVertexResolution.cc#L970
I can circumvent the issue by commenting that line, but then when running I see the following warning:
What's somehow puzzling to me, is that when the address of the output file is local (e.g. the
$PWD
) even if the file is already existing there, there is no issue whatsoever. Also I would have thought that due to this:https://github.com/cms-sw/cmssw/blob/5e10089258d0881a194af508a2e802093b3916be/CommonTools/UtilAlgos/src/TFileService.cc#L22
the file would have been overwritten anyway. Also when trying to prepare a reproducer via a simple ROOT script:
I have found out that with this I can overwrite the remote file as many times as I want. Am I missing something trivial ?
Cc: @TomasKello