re-evaluate the optimisation flags for `liblzma.so`

fwyzard commented 1 year ago

A warning message that has been recently added to numpy uncovered that building liblzma.so with -Ofast alters the floating point behaviour of all executables and libraries that link with it, directly or indirectly.

For an interesting write up, see https://moyix.blogspot.com/2022/09/someones-been-messing-with-my-subnormals.html .

For a related discussion regarding this behaviour, and a proposal to change it, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 . The conclusion is that GCC 13 will no longer cause this to happen -- but that's quite far in the future for us.

As far as I can see, liblzma.so is the only external that we build with -Ofast. I think that we should

re-evaluate if -Ofast is the best choice, or whether -O2 or -O3 would perform better;
if we decide to stay with -Ofast, pass LDFLAGS='-O3' to disable the unsafe math at link time.

fwyzard commented 1 year ago

assign core

cmsbuild commented 1 year ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 year ago

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

I wonder what happens with all the CMSSW libraries compiled with -Ofast (after https://github.com/cms-sw/cmsdist/pull/8280 the ofast-flag still contains -funsafe-math-optimizations that IIUC controls this particular behavior). On the other hand, the documentation (gcc 11) explicitly notes for -funsafe-math-optimizations

When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.

Do I understand correctly that since the ofast-flag sets only CXXFLAGS https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/scram-tools.file/tools/gcc/ofast-flag.xml#L1-L5 the -Ofast is not used at any link step?

makortel commented 1 year ago

I see we use -ffast-math in fastjet https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L24 but only in CXXFLAGS https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L43 and therefore not as part of the linking? (on the other hand, for xz we set CFLAGS that apparently propagates to linking?)

fwyzard commented 1 year ago

I see we use -ffast-math in fastjet https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L24 but only in CXXFLAGS https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L43 and therefore not as part of the linking?

Yes, that's correct. The link commands are something like

libtool: link: g++  -fPIC -DPIC -shared -nostdlib /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o  .libs/libfastjet_la-DnnPlane.o .libs/libfastjet_la-Dnn4piCylinder.o .libs/libfastjet_la-Dnn3piCylinder.o .libs/libfastjet_la-Dnn2piCylinder.o .libs/libfastjet_la-ClusterSequence.o .libs/libfastjet_la-PseudoJet.o .libs/libfastjet_la-Selector.o .libs/libfastjet_la-ClusterSequence_N2.o .libs/libfastjet_la-ClusterSequence_TiledN2.o .libs/libfastjet_la-ClusterSequence_Delaunay.o .libs/libfastjet_la-ClusterSequence_DumbN3.o .libs/libfastjet_la-ClusterSequence_CP2DChan.o .libs/libfastjet_la-ClosestPair2D.o .libs/libfastjet_la-MinHeap.o .libs/libfastjet_la-ClusterSequenceAreaBase.o .libs/libfastjet_la-ClusterSequenceActiveAreaExplicitGhosts.o .libs/libfastjet_la-ClusterSequenceArea.o .libs/libfastjet_la-GhostedAreaSpec.o .libs/libfastjet_la-ClusterSequenceActiveArea.o .libs/libfastjet_la-Voronoi.o .libs/libfastjet_la-ClusterSequenceVoronoiArea.o .libs/libfastjet_la-ClusterSequencePassiveArea.o .libs/libfastjet_la-ClusterSequence1GhostPassiveArea.o .libs/libfastjet_la-PseudoJetStructureBase.o .libs/libfastjet_la-ClusterSequenceStructure.o .libs/libfastjet_la-BasicRandom.o .libs/libfastjet_la-JetDefinition.o .libs/libfastjet_la-Error.o .libs/libfastjet_la-AreaDefinition.o .libs/libfastjet_la-RangeDefinition.o .libs/libfastjet_la-CompositeJetStructure.o .libs/libfastjet_la-FunctionOfPseudoJet.o .libs/libfastjet_la-LimitedWarning.o .libs/libfastjet_la-TilingExtent.o .libs/libfastjet_la-LazyTiling9.o .libs/libfastjet_la-LazyTiling9Alt.o .libs/libfastjet_la-LazyTiling25.o .libs/libfastjet_la-LazyTiling9SeparateGhosts.o .libs/libfastjet_la-RectangularGrid.o   -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../.. -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-linux-gnu/11/crtfastmath.o /usr/lib/gcc/x86_64-linux-gnu/11/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crtn.o  -g -O3 -msse3   -Wl,-soname -Wl,libfastjet.so.0 -o .libs/libfastjet.so.0.0.0

(on the other hand, for xz we set CFLAGS that apparently propagates to linking?)

Yes, CFLAGS is used both for compilation and for linking.

fwyzard commented 1 year ago

Do I understand correctly that since the ofast-flag sets only CXXFLAGS https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/scram-tools.file/tools/gcc/ofast-flag.xml#L1-L5 the -Ofast is not used at any link step?

No, it does propagate to the link step as well:

>> Building shared library tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/libRecoVertexPrimaryVertexProducer.so
/data/cmssw/el8_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/bin/c++ -O2 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++1z -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fuse-ld=bfd -msse3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-deprecated-copy -Wno-unused-parameter -Wunused -Wparentheses -Wno-deprecated -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -Ofast -fno-reciprocal-math -mrecip=none -Wno-error=unused-variable -DBOOST_DISABLE_ASSERTS -flto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -shared -Wl,-E -Wl,-z,defs tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZ.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZT_vect.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZ_vect.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/GapClusterizerInZ.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/PrimaryVertexProducerAlgorithm.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/PrimaryVertexSorter.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/TrackFilterForPVFinding.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/VertexHigherPtSquared.cc.o -o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/libRecoVertexPrimaryVertexProducer.so -Wl,-E -Wl,--hash-style=gnu -L/tmp/fwyzard/CMSSW_13_0_1/biglib/el8_amd64_gcc11 -L/tmp/fwyzard/CMSSW_13_0_1/lib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/biglib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/lib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/external/el8_amd64_gcc11/lib -L/tmp/fwyzard/CMSSW_13_0_1/static/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/static/el8_amd64_gcc11 -lRecoVertexAdaptiveVertexFit -lRecoVertexKalmanVertexFit -lRecoVertexLinearizationPointFinders -lRecoVertexVertexTools -lRecoVertexVertexPrimitives -lTrackingToolsTransientTrack -lDataFormatsPatCandidates -lDataFormatsHLTReco -lDataFormatsBTauReco -lDataFormatsJetMatching -lDataFormatsL1TParticleFlow -lDataFormatsMETReco -lDataFormatsTauReco -lDataFormatsJetReco -lTrackingToolsGsfTools -lDataFormatsParticleFlowCandidate -lTrackingToolsPatternTools -lDataFormatsParticleFlowReco -lTrackingToolsTransientTrackingRecHit -lDataFormatsEgammaCandidates -lDataFormatsHcalIsolatedTrack -lDataFormatsL1TCorrelator -lDataFormatsL1TMuonPhase2 -lDataFormatsMuonReco -lTrackingToolsDetLayers -lDataFormatsL1TMuon -lDataFormatsRecoCandidate -lTrackingToolsGeomPropagators -lDataFormatsCSCDigi -lDataFormatsEgammaReco -lDataFormatsGsfTrackReco -lDataFormatsVertexReco -lSimDataFormatsTrackingAnalysis -lTrackingToolsTrajectoryState -lDataFormatsGEMDigi -lDataFormatsTrackReco -lDataFormatsGEMRecHit -lDataFormatsTrackCandidate -lDataFormatsTrackerRecHit2D -lTrackingToolsRecords -lDataFormatsCSCRecHit -lDataFormatsDTRecHit -lDataFormatsFTLRecHit -lDataFormatsTrajectorySeed -lRecoLocalTrackerRecords -lCalibTrackerRecords -lDataFormatsL1Trigger -lDataFormatsTrackingRecHit -lDataFormatsL1TrackTrigger -lMagneticFieldRecords -lCondFormatsDataRecord -lDataFormatsTrackerCommon -lGeometryCaloGeometry -lGeometryCommonTopologies -lCondFormatsAlignment -lDataFormatsBeamSpot -lDataFormatsCaloTowers -lDataFormatsEcalRecHit -lDataFormatsGeometryCommonDetAlgo -lDataFormatsHcalRecHit -lDataFormatsHepMCCandidate -lDataFormatsSiStripCluster -lGeometryRecords -lTrackingToolsAnalyticalJacobians -lCommonToolsStatistics -lCondFormatsAlignmentRecord -lDataFormatsCandidate -lDataFormatsDTDigi -lDataFormatsEcalDigi -lDataFormatsGeometrySurface -lDataFormatsHcalDigi -lDataFormatsL1GlobalCaloTrigger -lDataFormatsTrajectoryState -lMagneticFieldEngine -lDataFormatsCLHEP -lDataFormatsCaloRecHit -lDataFormatsEcalDetId -lDataFormatsForwardDetId -lDataFormatsGeometryVector -lDataFormatsHcalDetId -lDataFormatsL1CaloTrigger -lDataFormatsL1GlobalTrigger -lDataFormatsMuonDetId -lDataFormatsPhase2TrackerCluster -lDataFormatsSiPixelDetId -lDataFormatsSiStripDetId -lFWCoreFramework -lSimDataFormatsTrack -lSimDataFormatsVertex -lDataFormatsDetId -lDataFormatsFEDRawData -lDataFormatsL1GlobalMuonTrigger -lDataFormatsMath -lDataFormatsPhase2TrackerDigi -lDataFormatsScouting -lDataFormatsSiPixelCluster -lDataFormatsSiStripDigi -lFWCoreCommon -lFWCoreServiceRegistry -lSimDataFormatsGeneratorProducts -lDataFormatsCommon -lFWCoreParameterSet -lFWCoreMessageLogger -lDataFormatsProvenance -lFWCorePluginManager -lFWCoreReflection -lTrackingToolsTrajectoryParametrization -lCondFormatsSerialization -lFWCoreConcurrency -lFWCoreUtilities -lFWCoreVersion -lSimDataFormatsEncodedEventId -lPhysics -lHist -lMatrix -lGenVector -lMathMore -lTree -lNet -lThread -lMathCore -lRIO -lSmatrix -lboost_regex -lboost_serialization -lCore -lboost_thread -lboost_date_time -lCLHEP -lHepMCfio -lHepMC -lpcre -lbz2 -lgsl -luuid -ltbb -llzma -lz -lfmt -lHepMC3 -lHepMC3search -lcms-md5 -lopenblas -lcrypt -ldl -lrt -lstdc++fs -ltinyxml2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^^^^^^                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

makortel commented 1 year ago

I did some experimentation with the fptest.py script shown in the blog post https://moyix.blogspot.com/2022/09/someones-been-messing-with-my-subnormals.html

liblzma.so indeed alters the FP behavior
every library in fastjet also alters the FP behavior
libtbb.so does not alter the FP behavior (mostly a sanity check)
libRecoVertexPrimaryVertexProducer.so alters the FP behavior
already libFWCoreFramework.so and libDataFormatsProvenance.so alter the FP behavior (likely via ROOT that depends on liblzma.so)
libFWCoreUtilities.so does not alter the FP behavior

So in practice (nearly?) every CMSSW application runs by treating subnormals as zero.

fwyzard commented 1 year ago

So, questions...

is this the indented behaviour ?
is there a simple way to check with an EDAnalyzer ?
should we enable the same behaviour also for e.g. CUDA, adding -ftz to the NVCC flags ?

makortel commented 1 year ago

So, questions...

Good ones...

is this the indented behaviour ?

I'd say it is at least surprising behavior. What I read about it now, the claim seems to be that not flushing subnormals to zero can lead to a significant performance degradation on vectorized computations. I guess we should somehow test how much our application(s) would really be affected.

I also wonder how much of our code is susceptible for the subnormal flushing to zero (in either zeroing resulting in loss of precision, or being effectively required for converge).

Anyway, I think we should do the decisions consciously, and it would be better if we'd be able to control it.

is there a simple way to check with an EDAnalyzer ?

Could something as simple as constructing a subnormal float and double and checking if it is nonzero work?

should we enable the same behaviour also for e.g. CUDA, adding -ftz to the NVCC flags ?

It could be an interesting test, in part to see if it has any impact on performance, and also to see if it has any noticeable impact in physics. I suppose in principle it would be good to keep the CPU and GPU compilation setup similar.

fwyzard commented 1 year ago

@VinInn do you have any suggestions ?

VinInn commented 1 year ago

I think that if we have even a single library compiled with Ofast we should flush subnormal to zero in the main. we can flush on GPUs as well. I have indeed a trivial test to check if subnormal are (not) flushed to zero.

 int one=1;
  assert(0==*(const float *)(&one));

fwyzard commented 1 year ago

Just for info: https://developer.nvidia.com/blog/cuda-pro-tip-flush-denormals-confidence/ .

fwyzard commented 1 year ago

int one=1; assert(0==(const float )(&one));

thanks... I've had to add a volatile to prevent the compiler from computing it a compile time :-)

fwyzard commented 1 year ago

bool check_for_denormals() {
  unsigned int one = 1;
  volatile float value;
  std::memcpy((void*) &value, (void*) &one, sizeof(float));
  return (value != 0.);
}

VinInn commented 1 year ago

of course we shall not use rsqrtf to avoid differences with CPU (we disabled on CPU as well as it differs between AMD and INTEL)

Moelf commented 1 year ago

FWIW, here's some other good reference on why we shouldn't use fastmath in anything but the most tightly controlled project (i.e. when you can know every single platform you run on and every single piece of code linked with your code :))

gartung commented 1 week ago

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522

cmsbuild commented 1 week ago

cms-bot internal usage

smuzaffar commented 1 week ago

As discussed yesterday (in Core SW meeting), I rebuilt lzma with -O3 and then (using root benchmark [a] ) compare it with lzma with -Ofast. With -O3 root benchmark test is 0.75% slow (8.269s vs 8.207s avg of 5 runs). The generated f.root file in both cases has same size.

I have no objection on building lzma with -O3

[a] https://github.com/root-project/root/pull/6144#issuecomment-671236231

// write_lzma.cpp
#include <ROOT/RDataFrame.hxx>                                                                                          
#include <TStopwatch.h>                                                                                                 
#include <iostream>                                                                                                     

int main() {                                                                                                            
   ROOT::RDF::RSnapshotOptions opts;                                                                                    
   opts.fCompressionAlgorithm =  ROOT::kLZMA;                                                                           
   opts.fCompressionLevel = 6;                                                                                          
   opts.fLazy = true;                                                                                                   
   auto trigger = ROOT::RDataFrame(50000000).Define("x", [] { return 42; }).Snapshot<int>("t", "f.root", {"x"}, opts);  
   TStopwatch s;                                                                                                        
   s.Start();                                                                                                           
   *trigger;                                                                                                            
   s.Stop();                                                                                                            
   std::cout << s.RealTime() << std::endl;                                                                              
   return 0;                                                                                                            
}

cms-sw / cmssw

re-evaluate the optimisation flags for `liblzma.so` #41773