Open fwyzard opened 1 year ago
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @fwyzard Andrea Bocci.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
I wonder what happens with all the CMSSW libraries compiled with -Ofast
(after https://github.com/cms-sw/cmsdist/pull/8280 the ofast-flag
still contains -funsafe-math-optimizations
that IIUC controls this particular behavior). On the other hand, the documentation (gcc 11) explicitly notes for -funsafe-math-optimizations
When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.
Do I understand correctly that since the ofast-flag
sets only CXXFLAGS
https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/scram-tools.file/tools/gcc/ofast-flag.xml#L1-L5
the -Ofast
is not used at any link step?
I see we use -ffast-math
in fastjet
https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L24
but only in CXXFLAGS
https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L43
and therefore not as part of the linking? (on the other hand, for xz
we set CFLAGS
that apparently propagates to linking?)
I see we use
-ffast-math
infastjet
https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L24 but only inCXXFLAGS
https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L43 and therefore not as part of the linking?
Yes, that's correct. The link commands are something like
libtool: link: g++ -fPIC -DPIC -shared -nostdlib /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o .libs/libfastjet_la-DnnPlane.o .libs/libfastjet_la-Dnn4piCylinder.o .libs/libfastjet_la-Dnn3piCylinder.o .libs/libfastjet_la-Dnn2piCylinder.o .libs/libfastjet_la-ClusterSequence.o .libs/libfastjet_la-PseudoJet.o .libs/libfastjet_la-Selector.o .libs/libfastjet_la-ClusterSequence_N2.o .libs/libfastjet_la-ClusterSequence_TiledN2.o .libs/libfastjet_la-ClusterSequence_Delaunay.o .libs/libfastjet_la-ClusterSequence_DumbN3.o .libs/libfastjet_la-ClusterSequence_CP2DChan.o .libs/libfastjet_la-ClosestPair2D.o .libs/libfastjet_la-MinHeap.o .libs/libfastjet_la-ClusterSequenceAreaBase.o .libs/libfastjet_la-ClusterSequenceActiveAreaExplicitGhosts.o .libs/libfastjet_la-ClusterSequenceArea.o .libs/libfastjet_la-GhostedAreaSpec.o .libs/libfastjet_la-ClusterSequenceActiveArea.o .libs/libfastjet_la-Voronoi.o .libs/libfastjet_la-ClusterSequenceVoronoiArea.o .libs/libfastjet_la-ClusterSequencePassiveArea.o .libs/libfastjet_la-ClusterSequence1GhostPassiveArea.o .libs/libfastjet_la-PseudoJetStructureBase.o .libs/libfastjet_la-ClusterSequenceStructure.o .libs/libfastjet_la-BasicRandom.o .libs/libfastjet_la-JetDefinition.o .libs/libfastjet_la-Error.o .libs/libfastjet_la-AreaDefinition.o .libs/libfastjet_la-RangeDefinition.o .libs/libfastjet_la-CompositeJetStructure.o .libs/libfastjet_la-FunctionOfPseudoJet.o .libs/libfastjet_la-LimitedWarning.o .libs/libfastjet_la-TilingExtent.o .libs/libfastjet_la-LazyTiling9.o .libs/libfastjet_la-LazyTiling9Alt.o .libs/libfastjet_la-LazyTiling25.o .libs/libfastjet_la-LazyTiling9SeparateGhosts.o .libs/libfastjet_la-RectangularGrid.o -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../.. -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-linux-gnu/11/crtfastmath.o /usr/lib/gcc/x86_64-linux-gnu/11/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crtn.o -g -O3 -msse3 -Wl,-soname -Wl,libfastjet.so.0 -o .libs/libfastjet.so.0.0.0
(on the other hand, for
xz
we setCFLAGS
that apparently propagates to linking?)
Yes, CFLAGS
is used both for compilation and for linking.
Do I understand correctly that since the
ofast-flag
sets onlyCXXFLAGS
https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/scram-tools.file/tools/gcc/ofast-flag.xml#L1-L5 the-Ofast
is not used at any link step?
No, it does propagate to the link step as well:
>> Building shared library tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/libRecoVertexPrimaryVertexProducer.so
/data/cmssw/el8_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/bin/c++ -O2 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++1z -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fuse-ld=bfd -msse3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-deprecated-copy -Wno-unused-parameter -Wunused -Wparentheses -Wno-deprecated -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -Ofast -fno-reciprocal-math -mrecip=none -Wno-error=unused-variable -DBOOST_DISABLE_ASSERTS -flto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -shared -Wl,-E -Wl,-z,defs tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZ.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZT_vect.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZ_vect.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/GapClusterizerInZ.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/PrimaryVertexProducerAlgorithm.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/PrimaryVertexSorter.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/TrackFilterForPVFinding.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/VertexHigherPtSquared.cc.o -o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/libRecoVertexPrimaryVertexProducer.so -Wl,-E -Wl,--hash-style=gnu -L/tmp/fwyzard/CMSSW_13_0_1/biglib/el8_amd64_gcc11 -L/tmp/fwyzard/CMSSW_13_0_1/lib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/biglib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/lib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/external/el8_amd64_gcc11/lib -L/tmp/fwyzard/CMSSW_13_0_1/static/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/static/el8_amd64_gcc11 -lRecoVertexAdaptiveVertexFit -lRecoVertexKalmanVertexFit -lRecoVertexLinearizationPointFinders -lRecoVertexVertexTools -lRecoVertexVertexPrimitives -lTrackingToolsTransientTrack -lDataFormatsPatCandidates -lDataFormatsHLTReco -lDataFormatsBTauReco -lDataFormatsJetMatching -lDataFormatsL1TParticleFlow -lDataFormatsMETReco -lDataFormatsTauReco -lDataFormatsJetReco -lTrackingToolsGsfTools -lDataFormatsParticleFlowCandidate -lTrackingToolsPatternTools -lDataFormatsParticleFlowReco -lTrackingToolsTransientTrackingRecHit -lDataFormatsEgammaCandidates -lDataFormatsHcalIsolatedTrack -lDataFormatsL1TCorrelator -lDataFormatsL1TMuonPhase2 -lDataFormatsMuonReco -lTrackingToolsDetLayers -lDataFormatsL1TMuon -lDataFormatsRecoCandidate -lTrackingToolsGeomPropagators -lDataFormatsCSCDigi -lDataFormatsEgammaReco -lDataFormatsGsfTrackReco -lDataFormatsVertexReco -lSimDataFormatsTrackingAnalysis -lTrackingToolsTrajectoryState -lDataFormatsGEMDigi -lDataFormatsTrackReco -lDataFormatsGEMRecHit -lDataFormatsTrackCandidate -lDataFormatsTrackerRecHit2D -lTrackingToolsRecords -lDataFormatsCSCRecHit -lDataFormatsDTRecHit -lDataFormatsFTLRecHit -lDataFormatsTrajectorySeed -lRecoLocalTrackerRecords -lCalibTrackerRecords -lDataFormatsL1Trigger -lDataFormatsTrackingRecHit -lDataFormatsL1TrackTrigger -lMagneticFieldRecords -lCondFormatsDataRecord -lDataFormatsTrackerCommon -lGeometryCaloGeometry -lGeometryCommonTopologies -lCondFormatsAlignment -lDataFormatsBeamSpot -lDataFormatsCaloTowers -lDataFormatsEcalRecHit -lDataFormatsGeometryCommonDetAlgo -lDataFormatsHcalRecHit -lDataFormatsHepMCCandidate -lDataFormatsSiStripCluster -lGeometryRecords -lTrackingToolsAnalyticalJacobians -lCommonToolsStatistics -lCondFormatsAlignmentRecord -lDataFormatsCandidate -lDataFormatsDTDigi -lDataFormatsEcalDigi -lDataFormatsGeometrySurface -lDataFormatsHcalDigi -lDataFormatsL1GlobalCaloTrigger -lDataFormatsTrajectoryState -lMagneticFieldEngine -lDataFormatsCLHEP -lDataFormatsCaloRecHit -lDataFormatsEcalDetId -lDataFormatsForwardDetId -lDataFormatsGeometryVector -lDataFormatsHcalDetId -lDataFormatsL1CaloTrigger -lDataFormatsL1GlobalTrigger -lDataFormatsMuonDetId -lDataFormatsPhase2TrackerCluster -lDataFormatsSiPixelDetId -lDataFormatsSiStripDetId -lFWCoreFramework -lSimDataFormatsTrack -lSimDataFormatsVertex -lDataFormatsDetId -lDataFormatsFEDRawData -lDataFormatsL1GlobalMuonTrigger -lDataFormatsMath -lDataFormatsPhase2TrackerDigi -lDataFormatsScouting -lDataFormatsSiPixelCluster -lDataFormatsSiStripDigi -lFWCoreCommon -lFWCoreServiceRegistry -lSimDataFormatsGeneratorProducts -lDataFormatsCommon -lFWCoreParameterSet -lFWCoreMessageLogger -lDataFormatsProvenance -lFWCorePluginManager -lFWCoreReflection -lTrackingToolsTrajectoryParametrization -lCondFormatsSerialization -lFWCoreConcurrency -lFWCoreUtilities -lFWCoreVersion -lSimDataFormatsEncodedEventId -lPhysics -lHist -lMatrix -lGenVector -lMathMore -lTree -lNet -lThread -lMathCore -lRIO -lSmatrix -lboost_regex -lboost_serialization -lCore -lboost_thread -lboost_date_time -lCLHEP -lHepMCfio -lHepMC -lpcre -lbz2 -lgsl -luuid -ltbb -llzma -lz -lfmt -lHepMC3 -lHepMC3search -lcms-md5 -lopenblas -lcrypt -ldl -lrt -lstdc++fs -ltinyxml2
^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I did some experimentation with the fptest.py
script shown in the blog post https://moyix.blogspot.com/2022/09/someones-been-messing-with-my-subnormals.html
liblzma.so
indeed alters the FP behaviorfastjet
also alters the FP behaviorlibtbb.so
does not alter the FP behavior (mostly a sanity check)libRecoVertexPrimaryVertexProducer.so
alters the FP behaviorlibFWCoreFramework.so
and libDataFormatsProvenance.so
alter the FP behavior (likely via ROOT that depends on liblzma.so
)libFWCoreUtilities.so
does not alter the FP behaviorSo in practice (nearly?) every CMSSW application runs by treating subnormals as zero.
So, questions...
-ftz
to the NVCC flags ?So, questions...
Good ones...
- is this the indented behaviour ?
I'd say it is at least surprising behavior. What I read about it now, the claim seems to be that not flushing subnormals to zero can lead to a significant performance degradation on vectorized computations. I guess we should somehow test how much our application(s) would really be affected.
I also wonder how much of our code is susceptible for the subnormal flushing to zero (in either zeroing resulting in loss of precision, or being effectively required for converge).
Anyway, I think we should do the decisions consciously, and it would be better if we'd be able to control it.
- is there a simple way to check with an EDAnalyzer ?
Could something as simple as constructing a subnormal float and double and checking if it is nonzero work?
- should we enable the same behaviour also for e.g. CUDA, adding
-ftz
to the NVCC flags ?
It could be an interesting test, in part to see if it has any impact on performance, and also to see if it has any noticeable impact in physics. I suppose in principle it would be good to keep the CPU and GPU compilation setup similar.
@VinInn do you have any suggestions ?
I think that if we have even a single library compiled with Ofast
we should flush subnormal to zero in the main.
we can flush on GPUs as well.
I have indeed a trivial test to check if subnormal are (not) flushed to zero.
int one=1;
assert(0==*(const float *)(&one));
or similar (https://github.com/VinInn/ctest/blob/master/floatPrec/sub.cpp)
int one=1; assert(0==(const float )(&one));
thanks... I've had to add a volatile
to prevent the compiler from computing it a compile time :-)
bool check_for_denormals() {
unsigned int one = 1;
volatile float value;
std::memcpy((void*) &value, (void*) &one, sizeof(float));
return (value != 0.);
}
of course we shall not use rsqrtf
to avoid differences with CPU (we disabled on CPU as well as it differs between AMD and INTEL)
FWIW, here's some other good reference on why we shouldn't use fastmath in anything but the most tightly controlled project (i.e. when you can know every single platform you run on and every single piece of code linked with your code :))
cms-bot internal usage
As discussed yesterday (in Core SW meeting), I rebuilt lzma
with -O3
and then (using root benchmark [a] ) compare it with lzma
with -Ofast
. With -O3
root benchmark test is 0.75% slow (8.269s
vs 8.207s
avg of 5 runs). The generated f.root
file in both cases has same size.
I have no objection on building lzma
with -O3
[a] https://github.com/root-project/root/pull/6144#issuecomment-671236231
// write_lzma.cpp
#include <ROOT/RDataFrame.hxx>
#include <TStopwatch.h>
#include <iostream>
int main() {
ROOT::RDF::RSnapshotOptions opts;
opts.fCompressionAlgorithm = ROOT::kLZMA;
opts.fCompressionLevel = 6;
opts.fLazy = true;
auto trigger = ROOT::RDataFrame(50000000).Define("x", [] { return 42; }).Snapshot<int>("t", "f.root", {"x"}, opts);
TStopwatch s;
s.Start();
*trigger;
s.Stop();
std::cout << s.RealTime() << std::endl;
return 0;
}
A warning message that has been recently added to
numpy
uncovered that buildingliblzma.so
with-Ofast
alters the floating point behaviour of all executables and libraries that link with it, directly or indirectly.For an interesting write up, see https://moyix.blogspot.com/2022/09/someones-been-messing-with-my-subnormals.html .
For a related discussion regarding this behaviour, and a proposal to change it, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 . The conclusion is that GCC 13 will no longer cause this to happen -- but that's quite far in the future for us.
As far as I can see,
liblzma.so
is the only external that we build with-Ofast
. I think that we should-Ofast
is the best choice, or whether-O2
or-O3
would perform better;-Ofast
, passLDFLAGS='-O3'
to disable the unsafe math at link time.