cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Stuck 12_2_X jobs in Tier0 #37637

Closed makortel closed 1 year ago

makortel commented 2 years ago

The JIRA ticket https://its.cern.ch/jira/browse/CMSTZ-976 reports also some 12_2_X jobs getting stuck. Initial investigation shows that it is xrdcp following a successful cmsRun that gets stuck. I'm opening a separate issue to #37572 since the cause is different (and here I'm a bit afraid if we couldn't do much).

makortel commented 2 years ago

assign core

makortel commented 2 years ago

The wmagentJob.log ends with

2022-04-20 16:14:30,990:INFO:StageOutImpl:Running the stage out...

and the stack trace of the xrdcp is

Thread 6 (Thread 0x2b3ede185700 (LWP 29199) "xrdcp"):
#0  0x00002b3edac69b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1  0x00002b3edac69bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00002b3edac69c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3  0x00002b3ed90d2ba6 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#4  0x00002b3ed90d2c79 in RunRunnerThread () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#5  0x00002b3edac63ea5 in start_thread () from target:/lib64/libpthread.so.0
#6  0x00002b3ed9c94b0d in clone () from target:/lib64/libc.so.6

Thread 5 (Thread 0x2b3eddf84700 (LWP 29198) "xrdcp"):
#0  0x00002b3edac69b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1  0x00002b3edac69bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00002b3edac69c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3  0x00002b3ed90d2ba6 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#4  0x00002b3ed90d2c79 in RunRunnerThread () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#5  0x00002b3edac63ea5 in start_thread () from target:/lib64/libpthread.so.0
#6  0x00002b3ed9c94b0d in clone () from target:/lib64/libc.so.6

Thread 4 (Thread 0x2b3eddd83700 (LWP 29197) "xrdcp"):
#0  0x00002b3edac69b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1  0x00002b3edac69bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00002b3edac69c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3  0x00002b3ed90d2ba6 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#4  0x00002b3ed90d2c79 in RunRunnerThread () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#5  0x00002b3edac63ea5 in start_thread () from target:/lib64/libpthread.so.0
#6  0x00002b3ed9c94b0d in clone () from target:/lib64/libc.so.6

Thread 3 (Thread 0x2b3eddb82700 (LWP 29196) "xrdcp"):
#0  0x00002b3ed9c5b9fd in nanosleep () from target:/lib64/libc.so.6
#1  0x00002b3ed95c98a8 in XrdSysTimer::Wait(int) () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdUtils.so.3
#2  0x00002b3ed9055765 in XrdCl::TaskManager::RunTasks() () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#3  0x00002b3ed90558e9 in RunRunnerThread () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#4  0x00002b3edac63ea5 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00002b3ed9c94b0d in clone () from target:/lib64/libc.so.6

Thread 2 (Thread 0x2b3edd981700 (LWP 29195) "xrdcp"):
#0  0x00002b3ed9c950e3 in epoll_wait () from target:/lib64/libc.so.6
#1  0x00002b3ed95c3c32 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdUtils.so.3
#2  0x00002b3ed95c047d in XrdSys::IOEvents::BootStrap::Start(void*) () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdUtils.so.3
#3  0x00002b3ed95c8f47 in XrdSysThread_Xeq () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdUtils.so.3
#4  0x00002b3edac63ea5 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00002b3ed9c94b0d in clone () from target:/lib64/libc.so.6

Thread 1 (Thread 0x2b3edd363a80 (LWP 29194) "xrdcp"):
#0  0x00002b3edac69b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1  0x00002b3edac69bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00002b3edac69c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3  0x00002b3ed90ad176 in (anonymous namespace)::XRootDDestination::PutChunk(XrdCl::PageInfo&&) () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#4  0x00002b3ed90b79d8 in XrdCl::ClassicCopyJob::Run(XrdCl::CopyProgressHandler*) () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#5  0x00002b3ed909f450 in (anonymous namespace)::QueuedCopyJob::Run(void*) () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#6  0x00002b3ed90a11f0 in XrdCl::CopyProcess::Run(XrdCl::CopyProgressHandler*) () from target:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.61-1/el7-x86_64/usr/lib64/libXrdCl.so.3
#7  0x000000000040b45e in main ()

The process appears to be waiting on something.

cmsbuild commented 2 years ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 years ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

belforte commented 2 years ago

sounds like WMAgent should wrap the xrdcp call with some timeout ? Writing to remote storage may surely get stuck for reasons outside the job control.

germanfgv commented 2 years ago

@makortel I found another instance of the issue in 12_3_2, which includes the forkHandler fix, so it is likely due to this same issue. I put cmsRun and wmagentJob logs here:

/afs/cern.ch/user/c/cmst0/public/LongExecution/WorkerNodeLogs/29304.147

but I'm not sure how to get the xrdcp stack trace.

This affected a single job in a replay with around 30k jobs, so it is clear that this is a very rare issue.

makortel commented 2 years ago

Thanks @germanfgv. The cmsRun log ends with

Begin processing the 2270th record. Run 349840, Event 1314189019, LumiSection 833 on stream 1 at 29-Apr-2022 16:47:02.024 CEST
[2022-04-29 16:50:03.010170 +0200][Error  ][XRootD            ][  594] [st-048-bbcc9ba3.cern.ch:1095] Unable to get the response to request kXR_read (handle: 0x00000000, offset: 3051662952, size: 8388608)
[2022-04-29 16:50:17.820832 +0200][Error  ][File              ][  594] [0x537f80@root://eoscms.cern.ch:1094//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/78ba1e2e-d1b6-42be-ad5c-da0eaad0b00f.root?eos.app=cmst0&xrdcl.requuid=298fa3c7-59a9-4c7a-a060-3170079f81f4] Fatal file state error. Message kXR_read (handle: 0x00000000, offset: 3051662952, size: 8388608) returned with [ERROR] Operation expired
%MSG-w XrdAdaptorInternal:  (NoModuleName) 29-Apr-2022 16:50:17 CEST pre-events
XrdRequestManager::handle(name='root://eoscms.cern.ch//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/78ba1e2e-d1b6-42be-ad5c-da0eaad0b00f.root?eos.app=cmst0) failure when reading from st-048-bbcc9ba3.cern.ch:1095 (unknown site); failed with error '[ERROR] Operation expired' (errno=0, code=206).
%MSG
%MSG-w XrdAdaptorInternal:  (NoModuleName) 29-Apr-2022 16:50:17 CEST pre-events
Request failure when reading from st-048-bbcc9ba3.cern.ch:1095 (unknown site)
%MSG
%MSG-w XrdAdaptor:  (NoModuleName) 29-Apr-2022 16:50:17 CEST pre-events
Data is now served from  instead of previous cern.ch
%MSG
%MSG-w XrdAdaptorInternal:  (NoModuleName) 29-Apr-2022 16:50:17 CEST pre-events
Caught a CMSSW exception when running connection recovery.
%MSG
[2022-04-30 09:29:28.940534 +0200][Error  ][AsyncSock         ][  594] [st-048-cc819fd7.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-04-30 10:33:00.356438 +0200][Error  ][AsyncSock         ][  594] [eoscms.cern.ch:1094.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable

i.e. the cmsRun itself is stuck (symptoms looke like https://github.com/cms-sw/cmssw/issues/37572). If the stuck jobs in 12_3_2 (or later) occur again, I'd be interested to take a look (e.g. via attaching gdb to running process as we did before).

makortel commented 1 year ago

+core

Seems like this issue could be closed

cmsbuild commented 1 year ago

This issue is fully signed and ready to be closed.