jobs transiently killed in the BNB nue campaign

mt82 commented 3 weeks ago

I am getting swath of jobs transiently killed in the BNB nue campaign and I'm not sure why (got 0 fails on the test and also also 0 fails in the prod for the first two hours: see here&_a=(description:'',filters:!(),fullScreenMode:!f,options:(darkTheme:!f),panels:!((gridData:(h:15,i:'1',w:48,x:0,y:0),id:'2f40f420-b8ca-11e7-989a-91951b87e80a',panelIndex:'1',type:visualization,version:'6.8.23'),(gridData:(h:10,i:'2',w:24,x:24,y:15),id:'569cca30-b8ca-11e7-989a-91951b87e80a',panelIndex:'2',type:visualization,version:'6.8.23'),(gridData:(h:10,i:'3',w:24,x:0,y:15),id:'65759a00-b8ca-11e7-989a-91951b87e80a',panelIndex:'3',type:visualization,version:'6.8.23'),(columns:!(JobsubJobId,Owner,ExitCode,ExitSignal,MATCH_GLIDEIN_Site,MachineAttrMachine0,stdout,stderr),gridData:(h:30,i:'4',w:48,x:0,y:25),id:'7e94c3c0-b8cb-11e7-989a-91951b87e80a',panelIndex:'4',sort:!('@timestamp',desc),type:search,version:'6.8.23')),query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:'*',query:'POMS_TASK_ID:1849462'))),timeRestore:!f,title:'Fifebatch%20History',viewMode:view)) Seems like art completed but there's another issue?

Art has completed and will exit with status 0.
executable was killed: exiting 1
Tue Jun 11 04:17:17 UTC 2024 fife_wrap COMPLETED with exit status 1

mt82 commented 3 weeks ago

from @vitodb by checking the stderr of one of those jobs (55435449.0@jobsub03.fnal.gov) it has

event: [1718074169135] DEST   http_plugin   CLEANUP 1
event: [1718074169135] BOTH   http_plugin   TRANSFER:EXIT   ERROR: Copy failed (streamed). Last attempt: (Neon): Could not send request body: connection was closed by server (destination)
gfal-copy error: 6 (No such device or address) - TRANSFER ERROR: Copy failed (streamed). Last attempt: (Neon): Could not send request body: connection was closed by server (destination)
my_system: www_cp.sh  /srv/errors-fae9cd9c-c2e6-4212-a0e4-9469f4902013.log https://fndcadoor.fnal.gov:2880/icarus/scratch/users/icaruspro/dropbox/mc1/dropbox/mc1/logs/34/d0/errors-fae9cd9c-c2e6-4212-a0e4-9469f4902013.log
 -- error text: Copying 9224751376 bytes file:///srv/errors-fae9cd9c-c2e6-4212-a0e4-9469f4902013.log => https://fndcadoor.fnal.gov:2880/icarus/scratch/users/icaruspro/dropbox/mc1/dropbox/mc1/logs/34/d0/errors-fae9cd9c-c2e6-4212-a0e4-9469f4902013.log

It tried to copy back a log file of about 9GB and the file transfer failed, possibly this triggered the job failure Though many jobs in the submission are failing for

%MSG-s ArtException:  PostEndJob 11-Jun-2024 07:50:23 UTC ModuleEndJob
---- OtherArt BEGIN
  An exception was thrown while processing module GENIEGen/generator during beginJob
  ---- FatalRootError BEGIN
    Fatal Root Error: TFile::ReadBuffer
    error reading from file /cvmfs/sbn.osgstorage.org/pnfs/fnal.gov/usr/sbn/persistent/stash/physics/beam/GENIE/BNB/standard/v01_00/converted_beammc_icarus_0003.root Input/output error
    ROOT severity: 5000
  ---- FatalRootError END
---- OtherArt END
%MSG

anyhow, if you don't need those huge logs, it would be better to disable their transfer to dCache

mt82 commented 3 weeks ago

from @francois-drielsma Ok, I looked at 10 of the stdout logs of failed jobs and they all have:

%MSG-s ArtException:  PostEndJob 11-Jun-2024 07:50:23 UTC ModuleEndJob
---- OtherArt BEGIN
  An exception was thrown while processing module GENIEGen/generator during beginJob
  ---- FatalRootError BEGIN
    Fatal Root Error: TFile::ReadBuffer
    error reading from file /cvmfs/sbn.osgstorage.org/pnfs/fnal.gov/usr/sbn/persistent/stash/physics/beam/GENIE/BNB/standard/v01_00/converted_beammc_icarus_0003.root Input/output error
    ROOT severity: 5000
  ---- FatalRootError END
---- OtherArt END
%MSG

so I think it's the core issue. If you look at Kibana&_a=(description:'',filters:!(),fullScreenMode:!f,options:(darkTheme:!f),panels:!((gridData:(h:15,i:'1',w:48,x:0,y:0),id:'2f40f420-b8ca-11e7-989a-91951b87e80a',panelIndex:'1',type:visualization,version:'6.8.23'),(gridData:(h:10,i:'2',w:24,x:24,y:15),id:'569cca30-b8ca-11e7-989a-91951b87e80a',panelIndex:'2',type:visualization,version:'6.8.23'),(gridData:(h:10,i:'3',w:24,x:0,y:15),id:'65759a00-b8ca-11e7-989a-91951b87e80a',panelIndex:'3',type:visualization,version:'6.8.23'),(columns:!(JobsubJobId,Owner,ExitCode,ExitSignal,MATCH_GLIDEIN_Site,MachineAttrMachine0,stdout,stderr),gridData:(h:30,i:'4',w:48,x:0,y:25),id:'7e94c3c0-b8cb-11e7-989a-91951b87e80a',panelIndex:'4',sort:!('@timestamp',desc),type:search,version:'6.8.23')),query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:'*',query:'POMS_TASK_ID:1849462'))),timeRestore:!f,title:'Fifebatch%20History',viewMode:view)), there seems to be a correlation between the amount of concurrent jobs and the failure. Could it be that converted_beammc_icarus_0003.root craps out when too many jobs are trying to access it? What is the remedy? Just tune down the max number of concurrent? Has anyone seen this before? EDIT: There are 1001 of those beam files, so concurrent read seems unlikely. Maybe it's related to a specific subset of these files, but then I don't understand the correlation with the number of jobs running. (edited)

SBNSoftware / icarus-production

jobs transiently killed in the BNB nue campaign #21