Open makortel opened 2 months ago
assign core
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
cms-bot internal usage
A new Issue was created by @makortel.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Here are some examples of the present setenv()
calls
https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/PhysicsTools/TensorFlow/src/TensorFlow.cc#L90-L98
that is called from e.g.
https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/L1Trigger/L1CaloTrigger/plugins/L1NNCaloTauEmulator.cc#L720
https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/L1Trigger/L1CaloTrigger/plugins/L1NNCaloTauProducer.cc#L585
https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/RecoMuon/TrackerSeedGenerator/plugins/TSGForOIDNN.cc#L211
I tested that modifications to the os.environ
in the cfg.py
file are visible to the C++ side of cmsRun
.
One option would be to implement the environment variable setting purely in the python side. I think the solution would have to
Actually the next step should be a more detailed analysis of the existing use cases of setting the environment variables in CMSSW code.
We should also find a way to get an understanding if any 3rd party code calls setenv()
(in the parallel section of cmsRun
). Possible ways could be e.g. a gdb script or LD_PRELOADable library to override setenv()
.
It seems to me the most likely setenv()
call to cause problems is the one in tensorflow::setLogging()
https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/PhysicsTools/TensorFlow/src/TensorFlow.cc#L90-L98
Unsafe calls to tensorflow::setLogging()
:
initializeGlobalCache()
HGCalConcentratorAutoEncoderImpl
L1NNTauProducer::initializeGlobalCache()
PFECALSuperClusterProducer::initializeGlobalCache()
reco::SCProducerCache
constructor that calls DeepSCGraphEvaluation
constructor that calls the setLogging()
. * TSGForOIDNN
L2TauNNProducer::initializeGlobalCache()
L2TauNNProducerAlpaka::initializeGlobalCache()
endLuminosityBlock()
DTOccupancyTestML::dqmEndLuminosityBlock()
dqmEndLuminosityBlock()
. It would be more efficient to load the graph and create the session in the constructor.produce()
Possibly unsafe call:
testGraphLoadingCUDA::test()
setLogging()
before any inference calls.Acceptable calls (in single-threaded test code in PhysicsTools/TensorFlow
)
testConstSessionCUDA::test()
testGraphLoadingCUDA::test()
testHelloWorld::test()
testHelloWorldCUDA::test()
testMetaGraphLoadingCUDA::test()
testSessionCacheCUDA::test()
testVisibleDevicesCUDA::test()
Problems with this pattern
tensorflow::setLogging()
makes it very easy to add innocent-looking but unsafe codeTherefore, I would remove the tensorflow::setLogging()
, and set the TF_CPP_MIN_LOG_LEVEL=3
(or 2
if strongly motivated) in the tensorflow
scram tool file. @smuzaffar, @cms-sw/ml-l2
RivetAnalyzer::beginJob()
sets RIVET_REF_PATH
and RIVET_INFO_PATH
if they are not already set.
https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/GeneratorInterface/RivetInterface/plugins/RivetAnalyzer.cc#L53-L73
Would it be feasible to set RIVET_REF_PATH
and RIVET_INFO_PATH
in the rivet
scram tool file? @smuzaffar, @cms-sw/generators-l2
spu::Unzip
resets the TMPDIR
if it is longer than 50 characters
https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/GeneratorInterface/SherpaInterface/src/SherpackUtilities.cc#L154-L161
Seems that this code was added in https://github.com/cms-sw/cmssw/pull/21682 to work around a "feature" of OpenMPI 2.0 and 2.1 (more information in https://github.com/cms-sw/cmssw/pull/21419, https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/3807/1/1/1.html, https://www.open-mpi.org/faq/?category=osx#startup-errors-with-open-mpi-2.0.x).
Our OpenMPI version is now 4.1.6, maybe this workaround is no longer needed? @cms-sw/generators-l2
For future reference, the spu::Unzip
is called via
MPIService
constructor sets the OMPI_MCA_pmix_server_uri
from a configuration parameter, if such parameter exists.
https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/HeterogeneousCore/MPIServices/src/MPIService.cc#L46-L51
As Services are constructed in the serial part of cmsRun
, this setenv()
call is ok-ish.
SiStripConfigDb::usingDatabase()
resets TNS_ADMIN
https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/OnlineDB/SiStripConfigDb/src/SiStripConfigDb.cc#L260-L292
potentially modifying an existing value or a value from configuration.
The usingDatabase()
is called via
SiStripConfigDb::openDbConnection()
, but only if UsingDb
configuraton parameter is set to True
, and UsingDbCache
is set to FalseSiStripConfigDb
constructorSiStripConfigDb
is defined as a Service, and since Services are constructed in the serial part of cmsRun
, this setenv()
call is ok-ish.
main()
in OnlineDB/EcalCondDB/test/LaserSeqToDB.cpp
sets MESTORE
https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/OnlineDB/EcalCondDB/test/LaserSeqToDB.cpp#L1015
This program looks serial up to that point (unless ROOT would spawn threads), so is safe.
CastorShowerLibraryMaker::update()
has a commented out call
https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/SimG4CMS/ShowerLibraryProducer/plugins/CastorShowerLibraryMaker.cc#L686-L696
to set the CASTOR_SL_PG_MAXE
environment variable. As being commented out, it is "safe", but maybe this commented out code could be removed? @cms-sw/simulation-l2
assign ml, generators, simulation
New categories assigned: ml,generators,simulation
@bbilin,@civanch,@mdhildreth,@mkirsano,@menglu21,@lviliani,@kpedro88,@valsdav,@y19y19 you have been requested to review this Pull request/Issue and eventually sign? Thanks
Therefore, I would remove the
tensorflow::setLogging()
, and set theTF_CPP_MIN_LOG_LEVEL=3
(or2
if strongly motivated) in thetensorflow
scram tool file. @smuzaffar, @cms-sw/ml-l2
I made a PR removing tensorflow::setLogging()
https://github.com/cms-sw/cmssw/pull/46065
@smuzaffar Building on top of the cmsTraceExceptions
, I crafted the following script to trace the calls to setenv()
in the frameworks' parallel section (approximately)
#!/bin/bash
cat << EOF | gdb --args $@
set pagination off
set breakpoint pending on
# breakpoint between Service construction and first parallel section
break ScheduleItems::initMisc
run
# we don't need that breakpoint anymore
clear ScheduleItems::initMisc
# add additional breakpoint
break setenv
command
where
continue
end
continue
quit
EOF
Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?
Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2
sounds reasonable to me @makortel . Any objections @cms-sw/ml-l2 ?
Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2, @mseidel
right, scram can set there variables properly and we should do it. I will open cmsdist PR
Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?
Sure we can do that. So how will it working ? Should we just use --command ' --prefix catch-setenv.sh'
option for runTheMatrix ?
Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?
Sure we can do that. So how will it working ? Should we just use
--command ' --prefix catch-setenv.sh'
option for runTheMatrix ?
Maybe, I didn't really look into details. Where should we place the script? cmsTraceExceptions
is outside of CMSSW, on the other hand this script would have mild dependence on the framework details to figure out an approximate point to start the tracing.
(in principle the script could also be extended to catch other unwanted functions, but maybe it's better to not overgeneralize)
@makortel , in previous releases of Geant4 a campaign was carried out to substitute setenv by std:setenv and for dramatic reduction of number of calls to std::setenv. So, Geant4 should not bring any problem of this kind. Concerning Castor shower library I will make PR soon.
Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2, @mseidel
these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code if (!std::getenv("RIVET_REF_PATH")) { .... }
only calls setenv
if the env variable is not set so seting it via global runtime hooks will make sure that setenv
is not called.
@makortel , should we do it globally or do you want to fix it in 14.2.X for now?
Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2
these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code
if (!std::getenv("RIVET_REF_PATH")) { .... }
only callssetenv
if the env variable is not set so seting it via global runtime hooks will make sure thatsetenv
is not called.@makortel , should we do it globally or do you want to fix it in 14.2.X for now?
I'm fine with either way.
@civanch
in previous releases of Geant4 a campaign was carried out to substitute setenv by std:setenv and for dramatic reduction of number of calls to std::setenv.
Just to clear up confusion, do you mean getenv()
or setenv()
? I'm asking because there is no std::setenv()
(it is not part of C++ or C standards, but POSIX).
So, Geant4 should not bring any problem of this kind. Concerning Castor shower library I will make PR soon.
Thanks!
Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2
sounds reasonable to me @makortel . Any objections @cms-sw/ml-l2 ?
Assuming from https://github.com/cms-sw/cmssw/pull/46065#issuecomment-2363054393 that @cms-sw/ml-l2 is ok with the change, I opened a PR to cmsdist to set the environment variable https://github.com/cms-sw/cmsdist/pull/9418
please stop tagging me, I'm the wrong guy
Apologies!
Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2
these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code
if (!std::getenv("RIVET_REF_PATH")) { .... }
only callssetenv
if the env variable is not set so seting it via global runtime hooks will make sure thatsetenv
is not called.@makortel , should we do it globally or do you want to fix it in 14.2.X for now?
Tagging the right person this time @mseidel42 ...
I have added https://github.com/cms-sw/cms-common/commit/0471f579747f77a5a723337eed1c55cdf95afa45 which should do what https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/GeneratorInterface/RivetInterface/plugins/RivetAnalyzer.cc#L53-L73 is doing i.e. if RIVET_REF_PATH
or RIVET_INFO_PATH
are not set then set these pointing to src/GeneratorInterface/RivetInterface/data
under LOCALTOP
and RELEASETOP
.
When deployed then this script will run for all cmssw releases and set RIVET_REF_PATH
or RIVET_INFO_PATH
when one will run cmsenv
@makortel , sorry, of course, getenv.
+simulation
We also have putenv
in few places https://github.com/search?q=repo%3Acms-sw%2Fcmssw+putenv+language%3AC%2B%2B&type=code
This issue spins off from https://github.com/cms-sw/cmssw/issues/44659 after the discovery that our code is calling
setenv()
in the parallel part ofcmsRun
, and concurrently tostd::getenv()
calls (https://github.com/cms-sw/cmssw/issues/44659#issuecomment-2349138317).We need to first analyze the existing cases to call
setenv()
, and remove those for which there is a better way.If there are any legitimate use cases left, we need to figure out a mechanism for setting environment variables during the serial part of
cmsRun
(Service constructors would satisfy the requirement, but we might want something else than moving all thesetenv()
code to new Services).Then we need to migrate the existing code calling
setenv()
.