cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

Question on how `TFileService` is supposed to interact with `eos` #46024

Open mmusich opened 2 months ago

mmusich commented 2 months ago

I have a naive question concerning the expected behavior of TFileService when it's configured to (over-)write files on eos. While trying to re-run some alignment related jobs @henriettepetersen reported a segmentation fault in SplitVertexResolution, stack trace below:

Thread 1 (Thread 0x7f0c5f6c8640 (LWP 947722) "cmsRun"):
#0  0x00007f0c5e9019ff in poll () from /lib64/libc.so.6
#1  0x00007f0c5a5bf09f in full_read.constprop () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f0c5a5744ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f0c5a574670 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f0c5ff8ba47 in TStreamerInfo::ForceWriteInfo(TFile*, bool) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el9_amd64_gcc12/lib/libRIO.so
#6  0x00007f0c607642ae in TTree::BuildStreamerInfo(TClass*, void*, bool) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el9_amd64_gcc12/lib/libTree.so
#7  0x00007f0c60775f72 in TTree::BronchExec(char const*, char const*, void*, bool, int, int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el9_amd64_gcc12/lib/libTree.so
#8  0x00007f0bffa4acf5 in SplitVertexResolution::beginJob() () from /afs/cern.ch/cms/CAF/CMSALCA/ALCA_TRACKERALIGN/data/commonValidation/legacy_2024_releases/CMSSW_14_0_14/lib/el9_amd64_gcc12/pluginAlign
mentOfflinevalidationPlugins.so
#9  0x00007f0c60c1d322 in edm::Worker::beginJob() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/libFWCoreFramework.so
#10 0x00007f0c60c21a59 in edm::WorkerManager::beginJob(edm::ProductRegistry const&, edm::eventsetup::ESRecordsToProductResolverIndices const&, edm::ProcessBlockHelperBase const&) () from /cvmfs/cms.cern.
ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f0c60b4079f in edm::EventProcessor::beginJob() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/libFWCoreFramework.so
#12 0x000000000040746c in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#13 0x00007f0c5fd8096d in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc
12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/arena.cpp:688
#14 0x0000000000408ee2 in main::{lambda()#1}::operator()() const ()
#15 0x000000000040517c in main ()

(a reproducer is available at /afs/cern.ch/work/h/hpeterse/public/splitV_seg_fault, by copying locally the folder in any recent cmssw release and then running cmsRun validation_cfg.py config=validation.json).

The issue seems to be related to the fact that the file that we're trying to write already exists with the same name at the same location. In particular the segmentation fault originates here:

https://github.com/cms-sw/cmssw/blob/e54f434a789ce6ffaf824788a7b9bf79a50adf1a/Alignment/OfflineValidation/plugins/SplitVertexResolution.cc#L970

I can circumvent the issue by commenting that line, but then when running I see the following warning:

Warning in <TStorageFactoryFile::Write>: file root://eoscms.cern.ch//eos/cms/store/group/alca_trackeralign/AlignmentValidation/AlignmentValidation/2024_CDE_ReReco_mp3949_splitV_379525/SplitV/single/GT/compare2024/379525/SplitV.root not opened in write mode

What's somehow puzzling to me, is that when the address of the output file is local (e.g. the $PWD) even if the file is already existing there, there is no issue whatsoever. Also I would have thought that due to this:

https://github.com/cms-sw/cmssw/blob/5e10089258d0881a194af508a2e802093b3916be/CommonTools/UtilAlgos/src/TFileService.cc#L22

the file would have been overwritten anyway. Also when trying to prepare a reproducer via a simple ROOT script:

#include "TFile.h"
#include "TTree.h"
#include <iostream>
#include "Alignment/OfflineValidation/src/pvTree.h"
#include "PhysicsTools/FWLite/interface/TFileService.h"
#include <vector>
#include <string>

int test_TTreeEOS() {
  // Define the file path to EOS (replace with your EOS path)
  const std::string eosFilePath = "/eos/cms/store/group/alca_trackeralign/musich/test.root";

  fwlite::TFileService outfile_ = fwlite::TFileService(eosFilePath);

  // Create a TTree and a branch  
  pvEvent event_;
  event_.pvs.clear();
  event_.nVtx = -1;

  TTree* tree_ = outfile_.make<TTree>("pvTree", "pvTree");
  tree_->Branch("event", &event_, 64000, 2);

  return 0;
}

I have found out that with this I can overwrite the remote file as many times as I want. Am I missing something trivial ?

Cc: @TomasKello

cmsbuild commented 2 months ago

cms-bot internal usage

cmsbuild commented 2 months ago

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 2 months ago

assign CommonTools/UtilAlgos

cmsbuild commented 2 months ago

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 2 months ago

Since

https://github.com/cms-sw/cmssw/blob/5e10089258d0881a194af508a2e802093b3916be/CommonTools/UtilAlgos/src/TFileService.cc#L22

calls TFile::Open(), the I/O gets rerouted through our StorageFactory layer, which is also visible in the error message

Warning in <TStorageFactoryFile::Write>: file root://eoscms.cern.ch//eos/cms/store/group/alca_trackeralign/AlignmentValidation/AlignmentValidation/2024_CDE_ReReco_mp3949_splitV_379525/SplitV/single/GT/compare2024/379525/SplitV.root not opened in write mode

With the StorageFactory, root:// URLs lead to our XrdAdaptor layer to be used for the actual I/O. On a quick look the XrdAdaptor code looks like it should be able to deal with writing files too, but I'd guess the writing part hasn't been tested much (since we use xrootd predominantly for reading data).

It may be worth of noting here that writing to (CERN) EOS through the FUSE mount has an "interesting" behavior as well https://github.com/cms-sw/cmssw/issues/44369 (ROOT internally transforms the local-looking path into a root:// URL, while the StorageFactory layer continues act like the file would be local.

makortel commented 2 months ago

assign core

cmsbuild commented 2 months ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 2 months ago

Would you be able to try if adding

process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))

to the job configuration would impact the behavior? (this prevents CMSSW to register the StorageFactory + XrdAdaptor for root protocol)

makortel commented 2 months ago

type root

makortel commented 2 months ago

@pcanal Could there be some error condition (or other assumption) in TStreamerInfo::ForceWriteInfo() that could lead it to segfault instead of reporting an error when writing to CERN EOS through xrootd?

pcanal commented 2 months ago

I can see two possibility. One is that the ROOT build being used does not have the code from https://github.com/root-project/root/pull/13842.

The other, more likely, is that writing in a file open in read-only mode might not be failing elegantly .... i.e.

Warning in <TStorageFactoryFile::Write>: file root://eoscms.cern.ch//eos/....SplitV.root not opened in write mode

when/if the file was open with "RECREATE" indicates that something 'bad' happened during the TFile::Open (and undefined behavior might be a consequence thereof).

One possibility is that the file is seen/thought-of as non-writeable (for example issue with permissions) and that some part of the logic in or around TFile::Open is silently falling back to opening the file in read-only mode.

mmusich commented 2 months ago

Would you be able to try if adding

process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))

to the job configuration would impact the behavior? (this prevents CMSSW to register the StorageFactory + XrdAdaptor for root protocol)

indeed, adding this line in the configuration file, the segmentation fault is prevented. Thank you.

jfernan2 commented 2 weeks ago

@mmusich is this issue solved? Thanks

mmusich commented 2 weeks ago

@jfernan2

is this issue solved?

I am not sure. With the workaround at https://github.com/cms-sw/cmssw/issues/46024#issuecomment-2356593133 this particular instance of the problem is solved, though I can't say if that's a design feature or a bug.