cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

`setenv()` is not thread safe #46002

Open makortel opened 2 months ago

makortel commented 2 months ago

This issue spins off from https://github.com/cms-sw/cmssw/issues/44659 after the discovery that our code is calling setenv() in the parallel part of cmsRun, and concurrently to std::getenv() calls (https://github.com/cms-sw/cmssw/issues/44659#issuecomment-2349138317).

We need to first analyze the existing cases to call setenv(), and remove those for which there is a better way.

If there are any legitimate use cases left, we need to figure out a mechanism for setting environment variables during the serial part of cmsRun (Service constructors would satisfy the requirement, but we might want something else than moving all the setenv() code to new Services).

Then we need to migrate the existing code calling setenv().

makortel commented 2 months ago

assign core

cmsbuild commented 2 months ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 months ago

cms-bot internal usage

cmsbuild commented 2 months ago

A new Issue was created by @makortel.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 2 months ago

Here are some examples of the present setenv() calls https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/PhysicsTools/TensorFlow/src/TensorFlow.cc#L90-L98 that is called from e.g. https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/L1Trigger/L1CaloTrigger/plugins/L1NNCaloTauEmulator.cc#L720 https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/L1Trigger/L1CaloTrigger/plugins/L1NNCaloTauProducer.cc#L585 https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/RecoMuon/TrackerSeedGenerator/plugins/TSGForOIDNN.cc#L211


https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/GeneratorInterface/RivetInterface/plugins/RivetAnalyzer.cc#L62


https://github.com/cms-sw/cmssw/blob/6e31ebe803aa3cd54ba56999036e36202a849dbc/HeterogeneousCore/MPIServices/src/MPIService.cc#L50

makortel commented 2 months ago

I tested that modifications to the os.environ in the cfg.py file are visible to the C++ side of cmsRun.

One option would be to implement the environment variable setting purely in the python side. I think the solution would have to

makortel commented 2 months ago

Actually the next step should be a more detailed analysis of the existing use cases of setting the environment variables in CMSSW code.

We should also find a way to get an understanding if any 3rd party code calls setenv() (in the parallel section of cmsRun). Possible ways could be e.g. a gdb script or LD_PRELOADable library to override setenv().

makortel commented 2 months ago

It seems to me the most likely setenv() call to cause problems is the one in tensorflow::setLogging() https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/PhysicsTools/TensorFlow/src/TensorFlow.cc#L90-L98

Unsafe calls to tensorflow::setLogging():

Possibly unsafe call:

Acceptable calls (in single-threaded test code in PhysicsTools/TensorFlow)

Problems with this pattern

Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2

makortel commented 2 months ago

RivetAnalyzer::beginJob() sets RIVET_REF_PATH and RIVET_INFO_PATH if they are not already set. https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/GeneratorInterface/RivetInterface/plugins/RivetAnalyzer.cc#L53-L73

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2

makortel commented 2 months ago

spu::Unzip resets the TMPDIR if it is longer than 50 characters https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/GeneratorInterface/SherpaInterface/src/SherpackUtilities.cc#L154-L161 Seems that this code was added in https://github.com/cms-sw/cmssw/pull/21682 to work around a "feature" of OpenMPI 2.0 and 2.1 (more information in https://github.com/cms-sw/cmssw/pull/21419, https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/3807/1/1/1.html, https://www.open-mpi.org/faq/?category=osx#startup-errors-with-open-mpi-2.0.x).

Our OpenMPI version is now 4.1.6, maybe this workaround is no longer needed? @cms-sw/generators-l2

For future reference, the spu::Unzip is called via

makortel commented 2 months ago

MPIService constructor sets the OMPI_MCA_pmix_server_uri from a configuration parameter, if such parameter exists. https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/HeterogeneousCore/MPIServices/src/MPIService.cc#L46-L51

As Services are constructed in the serial part of cmsRun, this setenv() call is ok-ish.

makortel commented 2 months ago

SiStripConfigDb::usingDatabase() resets TNS_ADMIN https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/OnlineDB/SiStripConfigDb/src/SiStripConfigDb.cc#L260-L292 potentially modifying an existing value or a value from configuration.

The usingDatabase() is called via

SiStripConfigDb is defined as a Service, and since Services are constructed in the serial part of cmsRun, this setenv() call is ok-ish.

makortel commented 2 months ago

main() in OnlineDB/EcalCondDB/test/LaserSeqToDB.cpp sets MESTORE https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/OnlineDB/EcalCondDB/test/LaserSeqToDB.cpp#L1015 This program looks serial up to that point (unless ROOT would spawn threads), so is safe.

makortel commented 2 months ago

CastorShowerLibraryMaker::update() has a commented out call https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/SimG4CMS/ShowerLibraryProducer/plugins/CastorShowerLibraryMaker.cc#L686-L696 to set the CASTOR_SL_PG_MAXE environment variable. As being commented out, it is "safe", but maybe this commented out code could be removed? @cms-sw/simulation-l2

makortel commented 2 months ago

assign ml, generators, simulation

cmsbuild commented 2 months ago

New categories assigned: ml,generators,simulation

@bbilin,@civanch,@mdhildreth,@mkirsano,@menglu21,@lviliani,@kpedro88,@valsdav,@y19y19 you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 2 months ago

Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2

I made a PR removing tensorflow::setLogging() https://github.com/cms-sw/cmssw/pull/46065

makortel commented 2 months ago

@smuzaffar Building on top of the cmsTraceExceptions, I crafted the following script to trace the calls to setenv() in the frameworks' parallel section (approximately)

#!/bin/bash

cat << EOF | gdb --args $@
set pagination off
set breakpoint pending on
# breakpoint between Service construction and first parallel section
break ScheduleItems::initMisc
run
# we don't need that breakpoint anymore
clear ScheduleItems::initMisc
# add additional breakpoint
break setenv
command
where
continue
end
continue
quit
EOF

Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?

smuzaffar commented 2 months ago

Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2

sounds reasonable to me @makortel . Any objections @cms-sw/ml-l2 ?

smuzaffar commented 2 months ago

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2, @mseidel

right, scram can set there variables properly and we should do it. I will open cmsdist PR

smuzaffar commented 2 months ago

Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?

Sure we can do that. So how will it working ? Should we just use --command ' --prefix catch-setenv.sh' option for runTheMatrix ?

makortel commented 2 months ago

Would it be feasible to run e.g. all runTheMatrix workflows for the default IB (on the default architecture) e.g. once a week?

Sure we can do that. So how will it working ? Should we just use --command ' --prefix catch-setenv.sh' option for runTheMatrix ?

Maybe, I didn't really look into details. Where should we place the script? cmsTraceExceptions is outside of CMSSW, on the other hand this script would have mild dependence on the framework details to figure out an approximate point to start the tracing.

(in principle the script could also be extended to catch other unwanted functions, but maybe it's better to not overgeneralize)

civanch commented 2 months ago

@makortel , in previous releases of Geant4 a campaign was carried out to substitute setenv by std:setenv and for dramatic reduction of number of calls to std::setenv. So, Geant4 should not bring any problem of this kind. Concerning Castor shower library I will make PR soon.

smuzaffar commented 2 months ago

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2, @mseidel

these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code if (!std::getenv("RIVET_REF_PATH")) { .... } only calls setenv if the env variable is not set so seting it via global runtime hooks will make sure that setenv is not called.

@makortel , should we do it globally or do you want to fix it in 14.2.X for now?

makortel commented 2 months ago

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2

these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code if (!std::getenv("RIVET_REF_PATH")) { .... } only calls setenv if the env variable is not set so seting it via global runtime hooks will make sure that setenv is not called.

@makortel , should we do it globally or do you want to fix it in 14.2.X for now?

I'm fine with either way.

makortel commented 2 months ago

@civanch

in previous releases of Geant4 a campaign was carried out to substitute setenv by std:setenv and for dramatic reduction of number of calls to std::setenv.

Just to clear up confusion, do you mean getenv() or setenv()? I'm asking because there is no std::setenv() (it is not part of C++ or C standards, but POSIX).

So, Geant4 should not bring any problem of this kind. Concerning Castor shower library I will make PR soon.

Thanks!

makortel commented 2 months ago

Therefore, I would remove the tensorflow::setLogging(), and set the TF_CPP_MIN_LOG_LEVEL=3 (or 2 if strongly motivated) in the tensorflow scram tool file. @smuzaffar, @cms-sw/ml-l2

sounds reasonable to me @makortel . Any objections @cms-sw/ml-l2 ?

Assuming from https://github.com/cms-sw/cmssw/pull/46065#issuecomment-2363054393 that @cms-sw/ml-l2 is ok with the change, I opened a PR to cmsdist to set the environment variable https://github.com/cms-sw/cmsdist/pull/9418

mseidel commented 2 months ago

please stop tagging me, I'm the wrong guy

makortel commented 2 months ago

Apologies!

makortel commented 2 months ago

Would it be feasible to set RIVET_REF_PATH and RIVET_INFO_PATH in the rivet scram tool file? @smuzaffar, @cms-sw/generators-l2

these we can set via scram global runtime hooks and that should fix the existing/old releases too. as code if (!std::getenv("RIVET_REF_PATH")) { .... } only calls setenv if the env variable is not set so seting it via global runtime hooks will make sure that setenv is not called.

@makortel , should we do it globally or do you want to fix it in 14.2.X for now?

Tagging the right person this time @mseidel42 ...

smuzaffar commented 2 months ago

I have added https://github.com/cms-sw/cms-common/commit/0471f579747f77a5a723337eed1c55cdf95afa45 which should do what https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/GeneratorInterface/RivetInterface/plugins/RivetAnalyzer.cc#L53-L73 is doing i.e. if RIVET_REF_PATH or RIVET_INFO_PATH are not set then set these pointing to src/GeneratorInterface/RivetInterface/data under LOCALTOP and RELEASETOP.

When deployed then this script will run for all cmssw releases and set RIVET_REF_PATH or RIVET_INFO_PATH when one will run cmsenv

civanch commented 2 months ago

@makortel , sorry, of course, getenv.

civanch commented 2 months ago

+simulation

smuzaffar commented 1 month ago

We also have putenv in few places https://github.com/search?q=repo%3Acms-sw%2Fcmssw+putenv+language%3AC%2B%2B&type=code