Open Dr15Jones opened 2 weeks ago
cms-bot internal usage
A new Issue was created by @Dr15Jones.
@makortel, @sextonkennedy, @smuzaffar, @antoniovilela, @Dr15Jones, @rappoccio can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
The full log information can be found
https://cernbox.cern.ch/s/CQvjzkONaprof65 you should be able to find the WMTaskSpace dir in the vocms0253.cern.ch-283595-1 tarball.
assign root
assign core
type root
@pcanal Please take a look
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
As a guess, I'd say the recursion is from this line
The file that appears to be being opened is a premix file. As far as I can tell, the Embedded root source (used by the mixing module) does NOT make use of the lazy-download system as that code does not call storage::StorageFactory::get()->stagein(...)
The recursion could also happen here:
as TNetXNGFile does have the ability to return a non blank GetNewUrl.
Indeed both places are missing a test that the new attempt is different from the current ...
Do we want to backport the fix, and if so, how far back? 10_6_X (where the issue occurred) might be too far back, but maybe 14_0_X would make sense?
A production job at RAL exceeded 70GB of memory and was killed by the site. A subsequent change to stop 'lazy download' appears to have avoided the problem.
Looking at the log file from one failed job, the trace back shows all threads but 1 waiting, where the only active thread looks like
note that the debugger just gave up after going 12 million deep in the stack (so the stack was likely larger).
This job was using CMSSW_10_6_17_patch1 and running