cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

re-evaluate the optimisation flags for `liblzma.so` #41773

Open fwyzard opened 1 year ago

fwyzard commented 1 year ago

A warning message that has been recently added to numpy uncovered that building liblzma.so with -Ofast alters the floating point behaviour of all executables and libraries that link with it, directly or indirectly.

For an interesting write up, see https://moyix.blogspot.com/2022/09/someones-been-messing-with-my-subnormals.html .

For a related discussion regarding this behaviour, and a proposal to change it, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 . The conclusion is that GCC 13 will no longer cause this to happen -- but that's quite far in the future for us.

As far as I can see, liblzma.so is the only external that we build with -Ofast. I think that we should

fwyzard commented 1 year ago

assign core

cmsbuild commented 1 year ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 year ago

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

I wonder what happens with all the CMSSW libraries compiled with -Ofast (after https://github.com/cms-sw/cmsdist/pull/8280 the ofast-flag still contains -funsafe-math-optimizations that IIUC controls this particular behavior). On the other hand, the documentation (gcc 11) explicitly notes for -funsafe-math-optimizations

When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.

Do I understand correctly that since the ofast-flag sets only CXXFLAGS https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/scram-tools.file/tools/gcc/ofast-flag.xml#L1-L5 the -Ofast is not used at any link step?

makortel commented 1 year ago

I see we use -ffast-math in fastjet https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L24 but only in CXXFLAGS https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L43 and therefore not as part of the linking? (on the other hand, for xz we set CFLAGS that apparently propagates to linking?)

fwyzard commented 1 year ago

I see we use -ffast-math in fastjet https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L24 but only in CXXFLAGS https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/fastjet.spec#L43 and therefore not as part of the linking?

Yes, that's correct. The link commands are something like

libtool: link: g++  -fPIC -DPIC -shared -nostdlib /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o  .libs/libfastjet_la-DnnPlane.o .libs/libfastjet_la-Dnn4piCylinder.o .libs/libfastjet_la-Dnn3piCylinder.o .libs/libfastjet_la-Dnn2piCylinder.o .libs/libfastjet_la-ClusterSequence.o .libs/libfastjet_la-PseudoJet.o .libs/libfastjet_la-Selector.o .libs/libfastjet_la-ClusterSequence_N2.o .libs/libfastjet_la-ClusterSequence_TiledN2.o .libs/libfastjet_la-ClusterSequence_Delaunay.o .libs/libfastjet_la-ClusterSequence_DumbN3.o .libs/libfastjet_la-ClusterSequence_CP2DChan.o .libs/libfastjet_la-ClosestPair2D.o .libs/libfastjet_la-MinHeap.o .libs/libfastjet_la-ClusterSequenceAreaBase.o .libs/libfastjet_la-ClusterSequenceActiveAreaExplicitGhosts.o .libs/libfastjet_la-ClusterSequenceArea.o .libs/libfastjet_la-GhostedAreaSpec.o .libs/libfastjet_la-ClusterSequenceActiveArea.o .libs/libfastjet_la-Voronoi.o .libs/libfastjet_la-ClusterSequenceVoronoiArea.o .libs/libfastjet_la-ClusterSequencePassiveArea.o .libs/libfastjet_la-ClusterSequence1GhostPassiveArea.o .libs/libfastjet_la-PseudoJetStructureBase.o .libs/libfastjet_la-ClusterSequenceStructure.o .libs/libfastjet_la-BasicRandom.o .libs/libfastjet_la-JetDefinition.o .libs/libfastjet_la-Error.o .libs/libfastjet_la-AreaDefinition.o .libs/libfastjet_la-RangeDefinition.o .libs/libfastjet_la-CompositeJetStructure.o .libs/libfastjet_la-FunctionOfPseudoJet.o .libs/libfastjet_la-LimitedWarning.o .libs/libfastjet_la-TilingExtent.o .libs/libfastjet_la-LazyTiling9.o .libs/libfastjet_la-LazyTiling9Alt.o .libs/libfastjet_la-LazyTiling25.o .libs/libfastjet_la-LazyTiling9SeparateGhosts.o .libs/libfastjet_la-RectangularGrid.o   -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../.. -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-linux-gnu/11/crtfastmath.o /usr/lib/gcc/x86_64-linux-gnu/11/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crtn.o  -g -O3 -msse3   -Wl,-soname -Wl,libfastjet.so.0 -o .libs/libfastjet.so.0.0.0

(on the other hand, for xz we set CFLAGS that apparently propagates to linking?)

Yes, CFLAGS is used both for compilation and for linking.

fwyzard commented 1 year ago

Do I understand correctly that since the ofast-flag sets only CXXFLAGS https://github.com/cms-sw/cmsdist/blob/d596fb4fbdea8d13b4c7478b8926ac36b3e296ca/scram-tools.file/tools/gcc/ofast-flag.xml#L1-L5 the -Ofast is not used at any link step?

No, it does propagate to the link step as well:

>> Building shared library tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/libRecoVertexPrimaryVertexProducer.so
/data/cmssw/el8_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/bin/c++ -O2 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++1z -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fuse-ld=bfd -msse3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-deprecated-copy -Wno-unused-parameter -Wunused -Wparentheses -Wno-deprecated -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -Ofast -fno-reciprocal-math -mrecip=none -Wno-error=unused-variable -DBOOST_DISABLE_ASSERTS -flto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr -shared -Wl,-E -Wl,-z,defs tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZ.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZT_vect.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/DAClusterizerInZ_vect.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/GapClusterizerInZ.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/PrimaryVertexProducerAlgorithm.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/PrimaryVertexSorter.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/TrackFilterForPVFinding.cc.o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/VertexHigherPtSquared.cc.o -o tmp/el8_amd64_gcc11/src/RecoVertex/PrimaryVertexProducer/src/RecoVertexPrimaryVertexProducer/libRecoVertexPrimaryVertexProducer.so -Wl,-E -Wl,--hash-style=gnu -L/tmp/fwyzard/CMSSW_13_0_1/biglib/el8_amd64_gcc11 -L/tmp/fwyzard/CMSSW_13_0_1/lib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/biglib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/lib/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/external/el8_amd64_gcc11/lib -L/tmp/fwyzard/CMSSW_13_0_1/static/el8_amd64_gcc11 -L/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_1/static/el8_amd64_gcc11 -lRecoVertexAdaptiveVertexFit -lRecoVertexKalmanVertexFit -lRecoVertexLinearizationPointFinders -lRecoVertexVertexTools -lRecoVertexVertexPrimitives -lTrackingToolsTransientTrack -lDataFormatsPatCandidates -lDataFormatsHLTReco -lDataFormatsBTauReco -lDataFormatsJetMatching -lDataFormatsL1TParticleFlow -lDataFormatsMETReco -lDataFormatsTauReco -lDataFormatsJetReco -lTrackingToolsGsfTools -lDataFormatsParticleFlowCandidate -lTrackingToolsPatternTools -lDataFormatsParticleFlowReco -lTrackingToolsTransientTrackingRecHit -lDataFormatsEgammaCandidates -lDataFormatsHcalIsolatedTrack -lDataFormatsL1TCorrelator -lDataFormatsL1TMuonPhase2 -lDataFormatsMuonReco -lTrackingToolsDetLayers -lDataFormatsL1TMuon -lDataFormatsRecoCandidate -lTrackingToolsGeomPropagators -lDataFormatsCSCDigi -lDataFormatsEgammaReco -lDataFormatsGsfTrackReco -lDataFormatsVertexReco -lSimDataFormatsTrackingAnalysis -lTrackingToolsTrajectoryState -lDataFormatsGEMDigi -lDataFormatsTrackReco -lDataFormatsGEMRecHit -lDataFormatsTrackCandidate -lDataFormatsTrackerRecHit2D -lTrackingToolsRecords -lDataFormatsCSCRecHit -lDataFormatsDTRecHit -lDataFormatsFTLRecHit -lDataFormatsTrajectorySeed -lRecoLocalTrackerRecords -lCalibTrackerRecords -lDataFormatsL1Trigger -lDataFormatsTrackingRecHit -lDataFormatsL1TrackTrigger -lMagneticFieldRecords -lCondFormatsDataRecord -lDataFormatsTrackerCommon -lGeometryCaloGeometry -lGeometryCommonTopologies -lCondFormatsAlignment -lDataFormatsBeamSpot -lDataFormatsCaloTowers -lDataFormatsEcalRecHit -lDataFormatsGeometryCommonDetAlgo -lDataFormatsHcalRecHit -lDataFormatsHepMCCandidate -lDataFormatsSiStripCluster -lGeometryRecords -lTrackingToolsAnalyticalJacobians -lCommonToolsStatistics -lCondFormatsAlignmentRecord -lDataFormatsCandidate -lDataFormatsDTDigi -lDataFormatsEcalDigi -lDataFormatsGeometrySurface -lDataFormatsHcalDigi -lDataFormatsL1GlobalCaloTrigger -lDataFormatsTrajectoryState -lMagneticFieldEngine -lDataFormatsCLHEP -lDataFormatsCaloRecHit -lDataFormatsEcalDetId -lDataFormatsForwardDetId -lDataFormatsGeometryVector -lDataFormatsHcalDetId -lDataFormatsL1CaloTrigger -lDataFormatsL1GlobalTrigger -lDataFormatsMuonDetId -lDataFormatsPhase2TrackerCluster -lDataFormatsSiPixelDetId -lDataFormatsSiStripDetId -lFWCoreFramework -lSimDataFormatsTrack -lSimDataFormatsVertex -lDataFormatsDetId -lDataFormatsFEDRawData -lDataFormatsL1GlobalMuonTrigger -lDataFormatsMath -lDataFormatsPhase2TrackerDigi -lDataFormatsScouting -lDataFormatsSiPixelCluster -lDataFormatsSiStripDigi -lFWCoreCommon -lFWCoreServiceRegistry -lSimDataFormatsGeneratorProducts -lDataFormatsCommon -lFWCoreParameterSet -lFWCoreMessageLogger -lDataFormatsProvenance -lFWCorePluginManager -lFWCoreReflection -lTrackingToolsTrajectoryParametrization -lCondFormatsSerialization -lFWCoreConcurrency -lFWCoreUtilities -lFWCoreVersion -lSimDataFormatsEncodedEventId -lPhysics -lHist -lMatrix -lGenVector -lMathMore -lTree -lNet -lThread -lMathCore -lRIO -lSmatrix -lboost_regex -lboost_serialization -lCore -lboost_thread -lboost_date_time -lCLHEP -lHepMCfio -lHepMC -lpcre -lbz2 -lgsl -luuid -ltbb -llzma -lz -lfmt -lHepMC3 -lHepMC3search -lcms-md5 -lopenblas -lcrypt -ldl -lrt -lstdc++fs -ltinyxml2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^^^^^^                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
makortel commented 1 year ago

I did some experimentation with the fptest.py script shown in the blog post https://moyix.blogspot.com/2022/09/someones-been-messing-with-my-subnormals.html

So in practice (nearly?) every CMSSW application runs by treating subnormals as zero.

fwyzard commented 1 year ago

So, questions...

makortel commented 1 year ago

So, questions...

Good ones...

  • is this the indented behaviour ?

I'd say it is at least surprising behavior. What I read about it now, the claim seems to be that not flushing subnormals to zero can lead to a significant performance degradation on vectorized computations. I guess we should somehow test how much our application(s) would really be affected.

I also wonder how much of our code is susceptible for the subnormal flushing to zero (in either zeroing resulting in loss of precision, or being effectively required for converge).

Anyway, I think we should do the decisions consciously, and it would be better if we'd be able to control it.

  • is there a simple way to check with an EDAnalyzer ?

Could something as simple as constructing a subnormal float and double and checking if it is nonzero work?

  • should we enable the same behaviour also for e.g. CUDA, adding -ftz to the NVCC flags ?

It could be an interesting test, in part to see if it has any impact on performance, and also to see if it has any noticeable impact in physics. I suppose in principle it would be good to keep the CPU and GPU compilation setup similar.

fwyzard commented 1 year ago

@VinInn do you have any suggestions ?

VinInn commented 1 year ago

I think that if we have even a single library compiled with Ofast we should flush subnormal to zero in the main. we can flush on GPUs as well. I have indeed a trivial test to check if subnormal are (not) flushed to zero.

 int one=1;
  assert(0==*(const float *)(&one));

or similar (https://github.com/VinInn/ctest/blob/master/floatPrec/sub.cpp)

fwyzard commented 1 year ago

Just for info: https://developer.nvidia.com/blog/cuda-pro-tip-flush-denormals-confidence/ .

fwyzard commented 1 year ago

int one=1; assert(0==(const float )(&one));

thanks... I've had to add a volatile to prevent the compiler from computing it a compile time :-)

fwyzard commented 1 year ago
bool check_for_denormals() {
  unsigned int one = 1;
  volatile float value;
  std::memcpy((void*) &value, (void*) &one, sizeof(float));
  return (value != 0.);
}
VinInn commented 1 year ago

of course we shall not use rsqrtf to avoid differences with CPU (we disabled on CPU as well as it differs between AMD and INTEL)

Moelf commented 1 year ago

FWIW, here's some other good reference on why we shouldn't use fastmath in anything but the most tightly controlled project (i.e. when you can know every single platform you run on and every single piece of code linked with your code :))

gartung commented 1 week ago

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522

cmsbuild commented 1 week ago

cms-bot internal usage

smuzaffar commented 1 week ago

As discussed yesterday (in Core SW meeting), I rebuilt lzma with -O3 and then (using root benchmark [a] ) compare it with lzma with -Ofast. With -O3 root benchmark test is 0.75% slow (8.269s vs 8.207s avg of 5 runs). The generated f.root file in both cases has same size.

I have no objection on building lzma with -O3

[a] https://github.com/root-project/root/pull/6144#issuecomment-671236231

// write_lzma.cpp
#include <ROOT/RDataFrame.hxx>                                                                                          
#include <TStopwatch.h>                                                                                                 
#include <iostream>                                                                                                     

int main() {                                                                                                            
   ROOT::RDF::RSnapshotOptions opts;                                                                                    
   opts.fCompressionAlgorithm =  ROOT::kLZMA;                                                                           
   opts.fCompressionLevel = 6;                                                                                          
   opts.fLazy = true;                                                                                                   
   auto trigger = ROOT::RDataFrame(50000000).Define("x", [] { return 42; }).Snapshot<int>("t", "f.root", {"x"}, opts);  
   TStopwatch s;                                                                                                        
   s.Start();                                                                                                           
   *trigger;                                                                                                            
   s.Stop();                                                                                                            
   std::cout << s.RealTime() << std::endl;                                                                              
   return 0;                                                                                                            
}