Open mt82 opened 3 weeks ago
from @vitodb
by checking the stderr of one of those jobs (55435449.0@jobsub03.fnal.gov
) it has
event: [1718074169135] DEST http_plugin CLEANUP 1
event: [1718074169135] BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (streamed). Last attempt: (Neon): Could not send request body: connection was closed by server (destination)
gfal-copy error: 6 (No such device or address) - TRANSFER ERROR: Copy failed (streamed). Last attempt: (Neon): Could not send request body: connection was closed by server (destination)
my_system: www_cp.sh /srv/errors-fae9cd9c-c2e6-4212-a0e4-9469f4902013.log https://fndcadoor.fnal.gov:2880/icarus/scratch/users/icaruspro/dropbox/mc1/dropbox/mc1/logs/34/d0/errors-fae9cd9c-c2e6-4212-a0e4-9469f4902013.log
-- error text: Copying 9224751376 bytes file:///srv/errors-fae9cd9c-c2e6-4212-a0e4-9469f4902013.log => https://fndcadoor.fnal.gov:2880/icarus/scratch/users/icaruspro/dropbox/mc1/dropbox/mc1/logs/34/d0/errors-fae9cd9c-c2e6-4212-a0e4-9469f4902013.log
It tried to copy back a log file of about 9GB and the file transfer failed, possibly this triggered the job failure Though many jobs in the submission are failing for
%MSG-s ArtException: PostEndJob 11-Jun-2024 07:50:23 UTC ModuleEndJob
---- OtherArt BEGIN
An exception was thrown while processing module GENIEGen/generator during beginJob
---- FatalRootError BEGIN
Fatal Root Error: TFile::ReadBuffer
error reading from file /cvmfs/sbn.osgstorage.org/pnfs/fnal.gov/usr/sbn/persistent/stash/physics/beam/GENIE/BNB/standard/v01_00/converted_beammc_icarus_0003.root Input/output error
ROOT severity: 5000
---- FatalRootError END
---- OtherArt END
%MSG
anyhow, if you don't need those huge logs, it would be better to disable their transfer to dCache
from @francois-drielsma Ok, I looked at 10 of the stdout logs of failed jobs and they all have:
%MSG-s ArtException: PostEndJob 11-Jun-2024 07:50:23 UTC ModuleEndJob
---- OtherArt BEGIN
An exception was thrown while processing module GENIEGen/generator during beginJob
---- FatalRootError BEGIN
Fatal Root Error: TFile::ReadBuffer
error reading from file /cvmfs/sbn.osgstorage.org/pnfs/fnal.gov/usr/sbn/persistent/stash/physics/beam/GENIE/BNB/standard/v01_00/converted_beammc_icarus_0003.root Input/output error
ROOT severity: 5000
---- FatalRootError END
---- OtherArt END
%MSG
so I think it's the core issue. If you look at Kibana&_a=(description:'',filters:!(),fullScreenMode:!f,options:(darkTheme:!f),panels:!((gridData:(h:15,i:'1',w:48,x:0,y:0),id:'2f40f420-b8ca-11e7-989a-91951b87e80a',panelIndex:'1',type:visualization,version:'6.8.23'),(gridData:(h:10,i:'2',w:24,x:24,y:15),id:'569cca30-b8ca-11e7-989a-91951b87e80a',panelIndex:'2',type:visualization,version:'6.8.23'),(gridData:(h:10,i:'3',w:24,x:0,y:15),id:'65759a00-b8ca-11e7-989a-91951b87e80a',panelIndex:'3',type:visualization,version:'6.8.23'),(columns:!(JobsubJobId,Owner,ExitCode,ExitSignal,MATCH_GLIDEIN_Site,MachineAttrMachine0,stdout,stderr),gridData:(h:30,i:'4',w:48,x:0,y:25),id:'7e94c3c0-b8cb-11e7-989a-91951b87e80a',panelIndex:'4',sort:!('@timestamp',desc),type:search,version:'6.8.23')),query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:'*',query:'POMS_TASK_ID:1849462'))),timeRestore:!f,title:'Fifebatch%20History',viewMode:view)), there seems to be a correlation between the amount of concurrent jobs and the failure. Could it be that converted_beammc_icarus_0003.root
craps out when too many jobs are trying to access it? What is the remedy? Just tune down the max number of concurrent? Has anyone seen this before?
EDIT: There are 1001 of those beam files, so concurrent read seems unlikely. Maybe it's related to a specific subset of these files, but then I don't understand the correlation with the number of jobs running. (edited)
I am getting swath of jobs transiently killed in the BNB nue campaign and I'm not sure why (got 0 fails on the test and also also 0 fails in the prod for the first two hours: see here&_a=(description:'',filters:!(),fullScreenMode:!f,options:(darkTheme:!f),panels:!((gridData:(h:15,i:'1',w:48,x:0,y:0),id:'2f40f420-b8ca-11e7-989a-91951b87e80a',panelIndex:'1',type:visualization,version:'6.8.23'),(gridData:(h:10,i:'2',w:24,x:24,y:15),id:'569cca30-b8ca-11e7-989a-91951b87e80a',panelIndex:'2',type:visualization,version:'6.8.23'),(gridData:(h:10,i:'3',w:24,x:0,y:15),id:'65759a00-b8ca-11e7-989a-91951b87e80a',panelIndex:'3',type:visualization,version:'6.8.23'),(columns:!(JobsubJobId,Owner,ExitCode,ExitSignal,MATCH_GLIDEIN_Site,MachineAttrMachine0,stdout,stderr),gridData:(h:30,i:'4',w:48,x:0,y:25),id:'7e94c3c0-b8cb-11e7-989a-91951b87e80a',panelIndex:'4',sort:!('@timestamp',desc),type:search,version:'6.8.23')),query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:'*',query:'POMS_TASK_ID:1849462'))),timeRestore:!f,title:'Fifebatch%20History',viewMode:view)) Seems like art completed but there's another issue?