TuringLang / ParetoSmooth.jl

An implementation of PSIS algorithms in Julia.
http://turinglang.org/ParetoSmooth.jl/
MIT License
19 stars 12 forks source link

Remove all uses of Tullio.jl #72

Closed chriselrod closed 1 year ago

chriselrod commented 1 year ago

The motivation for this is that package load time is super linear as a function of number of packages/amount of code loaded. For this reason, large packages with many dependencies can reach truly extreme load times. Not all packages display this problem, but Tullio does. @time_imports, for example, shows:

  14424.8 ms  Tullio 0.30% compilation time
     56.5 ms  ParetoSmooth 53.39% compilation time (15% recompilation)

For a large package using ParetoSmooth. We would like to continue using ParetoSmooth, but without taking this hit from Tullio.

Before, the overall load time

273.378616 seconds (343.75 M allocations: 20.100 GiB, 48.79% gc time, 10.88% compilation time: 23% of which was recompilation)

After

262.427614 seconds (330.26 M allocations: 19.477 GiB, 47.88% gc time, 12.00% compilation time: 22% of which was recompilation)
chriselrod commented 1 year ago
└ @ TestEnv ~/.julia/packages/TestEnv/5v3ra/src/activate_set.jl:62
Basic Arrays: Error During Test at /home/chriselrod/.julia/dev/ParetoSmooth/test/tests/BasicTests.jl:6
  Got exception outside of a @test
  ArgumentError: No file exists at given path: data/Example_Log_Likelihood_Array.RData
  Stacktrace:
    [1] checkpath_load
      @ ~/.julia/packages/FileIO/aP78L/src/loadsave.jl:167 [inlined]
    [2] load(::String; options::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ FileIO ~/.julia/packages/FileIO/aP78L/src/loadsave.jl:110
    [3] load(::String)
      @ FileIO ~/.julia/packages/FileIO/aP78L/src/loadsave.jl:109
    [4] macro expansion
      @ ~/.julia/dev/ParetoSmooth/test/tests/BasicTests.jl:8 [inlined]
    [5] macro expansion
      @ ~/Documents/languages/juliamaster/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:1504 [inlined]
    [6] top-level scope
      @ ~/.julia/dev/ParetoSmooth/test/tests/BasicTests.jl:8
    [7] include(fname::String)
      @ Base.MainInclude ./client.jl:478
    [8] macro expansion
      @ ~/.julia/dev/ParetoSmooth/test/runtests.jl:7 [inlined]
    [9] macro expansion
      @ ~/Documents/languages/juliamaster/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:1504 [inlined]
   [10] top-level scope
      @ ~/.julia/dev/ParetoSmooth/test/runtests.jl:7
   [11] include(fname::String)
      @ Base.MainInclude ./client.jl:478
   [12] top-level scope
      @ none:1
   [13] eval
      @ ./boot.jl:370 [inlined]
   [14] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:280
   [15] _start()
      @ Base ./client.jl:541
WARNING: method definition for OneHotArray at /home/chriselrod/.julia/packages/MicroCollections/yJPLe/src/onehot.jl:79 declares type variable N but does not use it.
ERROR: LoadError: UndefVarError: `TypedSlot` not defined
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base ./Base.jl:31
 [2] top-level scope
   @ ~/.julia/packages/Libtask/YVNtg/src/tapedfunction.jl:288
 [3] include(mod::Module, _path::String)
   @ Base ./Base.jl:467
 [4] include(x::String)
   @ Libtask ~/.julia/packages/Libtask/YVNtg/src/Libtask.jl:1
 [5] top-level scope
   @ ~/.julia/packages/Libtask/YVNtg/src/Libtask.jl:10
 [6] include
   @ ./Base.jl:467 [inlined]
 [7] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
   @ Base ./loading.jl:2035
 [8] top-level scope
   @ stdin:2
in expression starting at /home/chriselrod/.julia/packages/Libtask/YVNtg/src/tapedfunction.jl:288
in expression starting at /home/chriselrod/.julia/packages/Libtask/YVNtg/src/Libtask.jl:1
in expression starting at stdin:2
ERROR: LoadError: Failed to precompile Libtask [6f1fad26-d15e-5dc8-ae53-837a1d7b8c9f] to "/home/chriselrod/.julia/compiled/v1.10/Libtask/jl_WIhhLB".
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
    @ Base ./loading.jl:2285
  [3] compilecache
    @ ./loading.jl:2152 [inlined]
  [4] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:1796
  [5] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:1654
  [6] macro expansion
    @ ./loading.jl:1640 [inlined]
  [7] macro expansion
    @ ./lock.jl:267 [inlined]
  [8] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1603
  [9] include
    @ ./Base.jl:467 [inlined]
 [10] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
    @ Base ./loading.jl:2035
 [11] top-level scope
    @ stdin:2
in expression starting at /home/chriselrod/.julia/packages/Turing/b6CoW/src/Turing.jl:1
in expression starting at stdin:2
ParetoSmooth.jl: Error During Test at /home/chriselrod/.julia/dev/ParetoSmooth/test/runtests.jl:5
  Got exception outside of a @test
  LoadError: Failed to precompile Turing [fce5fe82-541a-59a6-adf8-730c64b5f9a0] to "/home/chriselrod/.julia/compiled/v1.10/Turing/jl_s6eyBB".
  Stacktrace:
    [1] error(s::String)
      @ Base ./error.jl:35
    [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
      @ Base ./loading.jl:2285
    [3] compilecache
      @ ./loading.jl:2152 [inlined]
    [4] _require(pkg::Base.PkgId, env::String)
      @ Base ./loading.jl:1796
    [5] _require_prelocked(uuidkey::Base.PkgId, env::String)
      @ Base ./loading.jl:1654
    [6] macro expansion
      @ ./loading.jl:1640 [inlined]
    [7] macro expansion
      @ ./lock.jl:267 [inlined]
    [8] require(into::Module, mod::Symbol)
      @ Base ./loading.jl:1603
    [9] include(fname::String)
      @ Base.MainInclude ./client.jl:478
   [10] macro expansion
      @ ~/.julia/dev/ParetoSmooth/test/runtests.jl:8 [inlined]
   [11] macro expansion
      @ ~/Documents/languages/juliamaster/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:1504 [inlined]
   [12] top-level scope
      @ ~/.julia/dev/ParetoSmooth/test/runtests.jl:7
   [13] include(fname::String)
      @ Base.MainInclude ./client.jl:478
   [14] top-level scope
      @ none:1
   [15] eval
      @ ./boot.jl:370 [inlined]
   [16] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:280
   [17] _start()
      @ Base ./client.jl:541
  in expression starting at /home/chriselrod/.julia/dev/ParetoSmooth/test/tests/TuringTests.jl:1
Test Summary:   | Error  Total     Time
ParetoSmooth.jl |     2      2  1m33.8s
  Basic Arrays  |     1      1     0.9s
ERROR: LoadError: Some tests did not pass: 0 passed, 0 failed, 2 errored, 0 broken.
in expression starting at /home/chriselrod/.julia/dev/ParetoSmooth/test/runtests.jl:5

I'm still getting failures locally, but the .RData error looks a little odd.

ParadaCarleton commented 1 year ago

I'm still getting failures locally, but the .RData error looks a little odd.

I think this means you're in the wrong path/directory. The CI checks should be more helpful.

ParadaCarleton commented 1 year ago

The motivation for this is that package load time is super linear as a function of number of packages/amount of code loaded. For this reason, large packages with many dependencies can reach truly extreme load times.

Can you clarify what you mean/under what conditions this holds?

chriselrod commented 1 year ago

The motivation for this is that package load time is super linear as a function of number of packages/amount of code loaded. For this reason, large packages with many dependencies can reach truly extreme load times.

Can you clarify what you mean/under what conditions this holds?

You can try it here: https://github.com/JuliaComputing/OmniPackage.jl

julia> @time @time_imports using OmniPackage
[ Info: Precompiling OmniPackage [992a9bcd-042f-40bc-a011-a3ea62fdd9b8]
WARNING: Method definition adapt_storage(Type{var"#s1"} where var"#s1"<:(StaticArraysCore.SArray{S, T, N, L} where L where N where T), Array{T, N} where N where T) where {S} in module ArrayInterfaceStaticArraysCore at /home/chriselrod/.julia/packages/ArrayInterfaceStaticArraysCore/4vyfn/src/ArrayInterfaceStaticArraysCore.jl:31 overwritten in module AdaptStaticArraysExt at /home/chriselrod/.julia/packages/Adapt/xviDc/ext/AdaptStaticArraysExt.jl:6.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition solve(SciMLBase.IntegralProblem{isinplace, P, F, L, U, K} where K where U where L where F where P where isinplace) in module Integrals at /home/chriselrod/.julia/packages/Integrals/9qNWp/src/Integrals.jl:111 overwritten at /home/chriselrod/.julia/packages/Integrals/9qNWp/src/Integrals.jl:60.
  ** incremental compilation may be fatally broken for this module **

      0.3 ms  Compat
      2.5 ms  ConstructionBase
     39.9 ms  InitialValues
      0.5 ms  Requires
      0.3 ms  DataValueInterfaces
      1.9 ms  DataAPI
      0.3 ms  IteratorInterfaceExtensions
      0.3 ms  TableTraits
     25.5 ms  Tables
     89.2 ms  MacroTools
      1.7 ms  ZygoteRules
      5.1 ms  StaticArraysCore
     37.0 ms  Setfield
     36.8 ms  BangBang
      6.2 ms  ProgressMeter
      1.7 ms  ConsoleProgressMonitor
      4.5 ms  LoggingExtras
      6.1 ms  ProgressLogging
    120.4 ms  DataStructures
      0.6 ms  SortingAlgorithms
     18.6 ms  Missings
      2.5 ms  DocStringExtensions
      9.2 ms  IrrationalConstants
      0.5 ms  LogExpFunctions
      0.4 ms  StatsAPI
     41.9 ms  StatsBase
      8.9 ms  LeftChildRightSiblingTrees
      1.1 ms  TerminalLoggers
      0.5 ms  Adapt
      0.6 ms  ArgCheck
     72.3 ms  Baselet
      0.3 ms  CompositionsBase
      0.3 ms  DefineSingletons
     18.5 ms  MicroCollections
     20.2 ms  SplittablesBase
     74.6 ms  Transducers
      9.8 ms  AbstractMCMC
      0.4 ms  Reexport
      0.4 ms  OpenLibm_jll
      0.8 ms  OpenSpecFun_jll
     28.5 ms  SpecialFunctions
      0.9 ms  Rmath_jll
      1.1 ms  Rmath
      0.4 ms  NaNMath
      2.8 ms  Calculus
     27.6 ms  DualNumbers
      1.4 ms  HypergeometricFunctions
      8.0 ms  StatsFuns
      0.4 ms  UnPack
     46.6 ms  AdvancedHMC
    308.7 ms  CSTParser
     32.1 ms  PooledArrays
    157.4 ms  SentinelArrays
     13.0 ms  InlineStrings
     33.6 ms  WeakRefStrings
      7.5 ms  TranscodingStreams
      0.4 ms  Zlib_jll
      1.7 ms  CodecZlib
     27.5 ms  FilePathsBase
      2.4 ms  WorkerUtilities
    776.2 ms  CSV
      0.4 ms  SuiteSparse
      5.0 ms  ArrayInterface
      0.6 ms  ArrayInterface → ArrayInterfaceStaticArraysCoreExt
     27.5 ms  FunctionWrappers
      0.4 ms  MuladdMacro
      0.7 ms  Parameters
      0.2 ms  IfElse
     85.2 ms  Static
     12.2 ms  StaticArrayInterface
      3.2 ms  ManualMemory
     11.7 ms  ThreadingUtilities
      0.5 ms  SIMDTypes
      9.6 ms  LayoutPointers
      8.2 ms  CloseOpenIntervals
    407.3 ms  StrideArraysCore
      0.4 ms  BitTwiddlingConvenienceFunctions
      1.2 ms  CpuId
     74.9 ms  CPUSummary 17.02% compilation time
     12.7 ms  PolyesterWeave
      2.2 ms  Polyester
      0.7 ms  FastBroadcast
    136.7 ms  ChainRulesCore
      0.8 ms  LogExpFunctions → LogExpFunctionsChainRulesCoreExt
      1.1 ms  SpecialFunctions → SpecialFunctionsChainRulesCoreExt
      0.4 ms  StatsFuns → StatsFunsChainRulesCoreExt
     82.1 ms  RecipesBase
      6.1 ms  ArrayInterfaceCore
      1.5 ms  SymbolicIndexingInterface
      0.7 ms  ArrayInterfaceStaticArraysCore
    289.5 ms  FillArrays
      4.7 ms  GPUArraysCore
      0.5 ms  ArrayInterface → ArrayInterfaceGPUArraysCoreExt
     52.2 ms  RecursiveArrayTools
      1.2 ms  DiffRules
      5.2 ms  DiffResults
   1863.2 ms  StaticArrays
      1.2 ms  ConstructionBase → StaticArraysExt
      0.4 ms  Adapt → AdaptStaticArraysExt
      2.0 ms  StaticArrayInterface → StaticArrayInterfaceStaticArraysExt
      0.7 ms  CommonSubexpressions
    186.6 ms  ForwardDiff
      0.7 ms  EnumX
      3.7 ms  PreallocationTools
      1.3 ms  FunctionWrappersWrappers
      0.3 ms  CommonSolve
      0.4 ms  ExprTools
      1.6 ms  RuntimeGeneratedFunctions
      0.3 ms  Tricks
     68.3 ms  Lazy
    273.5 ms  SciMLOperators
    363.6 ms  SciMLBase
     17.0 ms  DiffEqBase
      1.6 ms  LaTeXStrings
      1.2 ms  Formatting
    643.8 ms  Latexify 41.09% compilation time (52% recompilation)
     22.5 ms  RandomNumbers
      0.5 ms  TreeViews
      0.4 ms  PoissonRandom
      8.3 ms  SimpleTraits
     11.3 ms  ArnoldiMethod
      1.2 ms  Inflate
     82.4 ms  Graphs
     53.5 ms  JumpProcesses
     59.9 ms  LabelledArrays
   2580.5 ms  Unitful
      0.8 ms  DiffEqBase → DiffEqBaseUnitfulExt
     15.2 ms  Distances
     12.1 ms  FiniteDiff
     15.9 ms  NLSolversBase
     14.7 ms  LineSearches
      6.6 ms  NLsolve
     11.7 ms  DiffEqCallbacks
     13.5 ms  Combinatorics
     53.2 ms  PDMats
      9.1 ms  QuadGK
      3.9 ms  InverseFunctions
      0.4 ms  LogExpFunctions → LogExpFunctionsInverseFunctionsExt
      0.3 ms  StatsFuns → StatsFunsInverseFunctionsExt
      4.1 ms  DensityInterface
   1096.1 ms  Distributions
      0.5 ms  DiffEqBase → DiffEqBaseDistributionsExt
      0.5 ms  TermInterface
    527.1 ms  TimerOutputs
      0.6 ms  AutoHashEquals
     11.0 ms  Referenceables
     27.8 ms  ThreadsX
     58.8 ms  Metatheory
   2861.9 ms  MutableArithmetics
    625.8 ms  MultivariatePolynomials
     76.7 ms  DynamicPolynomials
      5.3 ms  Bijections
    832.1 ms  SymbolicUtils
      2.3 ms  Glob
      3.9 ms  URIs
   3924.8 ms  CommonMark
   2362.7 ms  JuliaFormatter
     95.2 ms  IntervalSets
      0.9 ms  ConstructionBase → IntervalSetsExt
      8.6 ms  CompositeTypes
    378.8 ms  DomainSets
    123.4 ms  RandomExtensions
     13.9 ms  GroupsCore
    991.0 ms  AbstractAlgebra
      0.5 ms  IntegerMathUtils
     39.7 ms  Primes
     34.6 ms  Groebner
     48.9 ms  LambertW
   3560.9 ms  Symbolics
    510.5 ms  SimpleNonlinearSolve
    260.5 ms  ModelingToolkit
    103.2 ms  Catalyst
    722.4 ms  FixedPointNumbers
    396.8 ms  ColorTypes 6.82% compilation time (100% recompilation)
   2557.6 ms  Colors
     15.2 ms  InvertedIndices
    626.6 ms  StringManipulation
   1148.3 ms  PrettyTables
  10152.4 ms  DataFrames
      3.3 ms  PositiveFactorizations
     76.5 ms  Optim
   1649.0 ms  MLStyle
      0.5 ms  Memoize
      0.4 ms  Underscores
     51.5 ms  LeastSquaresOptim
     31.8 ms  RegularizationTools
      9.9 ms  DataInterpolations
      0.6 ms  DataInterpolations → DataInterpolationsChainRulesCoreExt
      0.8 ms  DataInterpolations → DataInterpolationsSymbolicsExt
     44.3 ms  ShowCases
      4.9 ms  ContextVariablesX
      0.4 ms  FLoopsBase
      6.8 ms  PrettyPrint
      0.9 ms  NameResolution
     18.3 ms  JuliaVariables
     21.6 ms  FLoops
    286.3 ms  Accessors
      0.6 ms  Accessors → IntervalSetsExt
      0.5 ms  Accessors → StaticArraysExt
   2855.4 ms  FoldsThreads 2348.19% compilation time
     89.2 ms  NNlib 56.46% compilation time
      0.7 ms  SimpleNonlinearSolve → SimpleBatchedNonlinearSolveExt
      3.3 ms  DelimitedFiles
     28.2 ms  MLUtils
     28.5 ms  DataDrivenDiffEq
     58.2 ms  DistributedArrays
      1.0 ms  EllipsisNotation
     22.7 ms  GenericSchur
  20306.5 ms  ExponentialUtilities
     20.0 ms  AbstractFFTs
   1847.1 ms  FFTW_jll 99.91% compilation time (100% recompilation)
   1242.4 ms  FFTW
     17.8 ms  Functors
     34.1 ms  Optimisers
      0.5 ms  RealDot
    116.1 ms  StructArrays
      0.6 ms  Accessors → StructArraysExt
    935.1 ms  ChainRules
     81.9 ms  IRTools
   5347.6 ms  Zygote 79.25% compilation time (29% recompilation)
      0.6 ms  DiffEqBase → DiffEqBaseZygoteExt
     15.4 ms  CEnum
    866.5 ms  LLVMExtra_jll 99.41% compilation time (98% recompilation)
    277.3 ms  LLVM
   6695.3 ms  GPUCompiler 0.87% compilation time (88% recompilation)
   1546.7 ms  GPUArrays
     23.7 ms  BFloat16s
     26.6 ms  CUDA_Driver_jll 95.75% compilation time
      0.7 ms  CUDA_Runtime_jll
     31.7 ms  Random123
  13550.8 ms  CUDA 2.12% compilation time (100% recompilation)
      0.6 ms  ArrayInterface → ArrayInterfaceCUDAExt
   1082.4 ms  OneHotArrays
      0.6 ms  CUDNN_jll
    445.1 ms  cuDNN 83.79% compilation time (100% recompilation)
    366.9 ms  NNlibCUDA
    156.1 ms  Flux
      9.7 ms  ShiftedArrays
     62.8 ms  StatsModels
     25.9 ms  GLM
     26.1 ms  Sobol
      1.9 ms  LatinHypercubeSampling
     76.2 ms  LatticeRules
     37.3 ms  QuasiMonteCarlo
     27.8 ms  WoodburyMatrices
     75.1 ms  Ratios 30.42% compilation time
      0.6 ms  AxisAlgorithms
   1943.3 ms  OffsetArrays
      1.6 ms  StaticArrayInterface → StaticArrayInterfaceOffsetArraysExt
    456.7 ms  Interpolations 5.64% compilation time
     14.6 ms  KernelDensity
      0.5 ms  Trapz
     27.2 ms  GlobalSensitivity
   2116.3 ms  Roots
     39.2 ms  HypothesisTests
      0.6 ms  MonteCarloIntegration
     13.9 ms  HCubature
     15.3 ms  Integrals
      0.8 ms  Integrals → IntegralsForwardDiffExt
      0.9 ms  Integrals → IntegralsZygoteExt
     65.4 ms  HostCPUFeatures
   4043.4 ms  VectorizationBase
     18.0 ms  SLEEFPirates
   3715.7 ms  LoopVectorization
      0.6 ms  LoopVectorization → SpecialFunctionsExt
     10.5 ms  LoopVectorization → ForwardDiffExt
     10.3 ms  TriangularSolve
   1550.5 ms  RecursiveFactorization
    111.1 ms  IterativeSolvers
     22.5 ms  KLU
     26.5 ms  Sparspak
     36.8 ms  FastLapackInterface
    106.5 ms  Krylov
    271.3 ms  KrylovKit
   3797.6 ms  LinearSolve
     11.4 ms  Loess
     12.8 ms  Contour
   1824.8 ms  SIMD
     10.6 ms  ScanByte
     27.4 ms  Automa
      1.1 ms  Bzip2_jll
      0.8 ms  FreeType2_jll
     30.2 ms  FreeType
      0.7 ms  TensorCore
   2698.7 ms  ColorVectorSpace 0.38% compilation time (100% recompilation)
    208.2 ms  IterTools
     13.2 ms  Extents
      1.9 ms  GeoInterface
      0.9 ms  EarCut_jll
   7786.0 ms  GeometryBasics
   2351.3 ms  FreeTypeAbstraction
     63.1 ms  UnicodeFun
      0.6 ms  Scratch
      1.0 ms  RelocatableFolders
    130.1 ms  MathTeXEngine
      1.2 ms  FriBidi_jll
      0.9 ms  Libiconv_jll
      0.8 ms  Libffi_jll
      1.2 ms  XML2_jll
      0.9 ms  Gettext_jll
      0.3 ms  PCRE2_jll
      0.9 ms  Libmount_jll
      2.4 ms  Glib_jll
      0.7 ms  Pixman_jll
      0.7 ms  libpng_jll
      0.6 ms  Libuuid_jll
      0.7 ms  Expat_jll
      7.9 ms  Fontconfig_jll 84.67% compilation time
      1.0 ms  Libgpg_error_jll
      1.0 ms  Libgcrypt_jll
      1.5 ms  XSLT_jll
      0.8 ms  Xorg_libXau_jll
      0.7 ms  Xorg_libXdmcp_jll
      0.5 ms  Xorg_libpthread_stubs_jll
      5.4 ms  Xorg_libxcb_jll
      0.7 ms  Xorg_xtrans_jll
      1.4 ms  Xorg_libX11_jll
      1.1 ms  Xorg_libXext_jll
      1.0 ms  Xorg_libXrender_jll
      0.7 ms  LZO_jll
      3.3 ms  Cairo_jll
      0.9 ms  Graphite2_jll
      2.0 ms  HarfBuzz_jll
      1.4 ms  libass_jll
      0.8 ms  libfdk_aac_jll
      0.8 ms  LAME_jll
      0.6 ms  Ogg_jll
      1.1 ms  libvorbis_jll
      0.8 ms  libaom_jll
      0.7 ms  x264_jll
      0.7 ms  x265_jll
      1.1 ms  OpenSSL_jll
      0.6 ms  Opus_jll
      5.5 ms  FFMPEG_jll
      0.6 ms  FFMPEG
    320.6 ms  Observables
     30.9 ms  ColorSchemes
   3144.9 ms  PlotUtils
      0.7 ms  ColorBrewer
     15.6 ms  Packing
      0.7 ms  SignedDistanceFields
     41.9 ms  MakieCore
      1.1 ms  isoband_jll
      0.5 ms  Isoband
      1.1 ms  PolygonOps
   1725.5 ms  GridLayoutBase
  10900.5 ms  FileIO
     11.1 ms  IndirectArrays
      5.5 ms  LazyModules
      1.0 ms  ImageIO
     12.7 ms  TriplotBase
      1.4 ms  Qhull_jll
      0.7 ms  QhullMiniWrapper_jll
      0.3 ms  MiniQhull
      5.1 ms  TupleTools
     13.2 ms  StableHashTraits
      0.6 ms  Showoff
      0.8 ms  Match
     20.1 ms  Animations
 106419.1 ms  Makie
     11.0 ms  RangeArrays
     89.5 ms  AxisArrays
      0.8 ms  ScientificTypesBase
      0.7 ms  StatisticalTraits
      5.7 ms  MLJModelInterface
     11.6 ms  MCMCDiagnosticTools
      0.5 ms  NaturalSort
     25.2 ms  MCMCChains
   3006.9 ms  CategoricalArrays 80.00% compilation time (96% recompilation)
     16.2 ms  ScientificTypes
     32.8 ms  ComputationalResources
     77.9 ms  LossFunctions
    201.4 ms  MarchingCubes
  48412.2 ms  UnicodePlots
     22.3 ms  UnicodePlots → FreeTypeExt
      3.8 ms  UnicodePlots → UnitfulExt
     50.6 ms  CategoricalDistributions
   5186.8 ms  MLJBase
     19.0 ms  MLJEnsembles
     21.9 ms  MLJTuning
     44.8 ms  PrettyPrinting
   2158.2 ms  MLJModels
      0.9 ms  MbedTLS_jll
     22.7 ms  MbedTLS
     18.5 ms  BitFlags
     70.3 ms  OpenSSL
      1.0 ms  SimpleBufferStream
    209.8 ms  HTTP
     35.4 ms  ARFFFiles
    314.9 ms  OpenML 1.91% compilation time
     10.6 ms  EarlyStopping
     57.2 ms  IterationControl
    132.0 ms  MLJIteration
     23.5 ms  MLJ
     11.5 ms  BSON
   3294.3 ms  Metalhead
     57.1 ms  MLJFlux
    123.7 ms  SparseMatricesCSR
     54.5 ms  StructTypes
   6630.1 ms  JSON3
  10586.6 ms  Highlights
      9.7 ms  MyterialColors
  63325.5 ms  Term
    257.4 ms  XGBoost_jll 98.45% compilation time (100% recompilation)
     22.8 ms  XGBoost
      9.1 ms  MLJXGBoostInterface
      1.5 ms  VertexSafeGraphs
    226.8 ms  SparseDiffTools 72.76% compilation time (82% recompilation)
   4462.9 ms  NonlinearSolve
      0.9 ms  FastClosures
      6.5 ms  SciMLNLSolve
  36089.4 ms  OrdinaryDiffEq
     19.5 ms  PSIS
      3.2 ms  CovarianceEstimation
    472.4 ms  NamedDims
     10.4 ms  LazyStack
   3184.5 ms  AxisKeys
      0.8 ms  Accessors → AxisKeysExt
  26312.4 ms  Tullio 0.34% compilation time
    140.8 ms  ParetoSmooth 43.67% compilation time (15% recompilation)
      1.3 ms  ProvenanceBase
      8.8 ms  QuantileRegressions
      1.8 ms  ReadStat_jll
    482.7 ms  ReadStatTables
      1.0 ms  RecursiveTupleMath
    244.6 ms  VectorizedRNG
    170.4 ms  SimpleChains
     40.5 ms  Survival
      7.2 ms  ChangesOfVariables
      0.7 ms  LogExpFunctions → LogExpFunctionsChangesOfVariablesExt
      6.5 ms  ChangesOfVariables → ChangesOfVariablesChainRulesCoreExt
     66.8 ms  TransformVariables
      1.6 ms  OmniPackage
741.528610 seconds (313.92 M allocations: 15.696 GiB, 31.93% gc time, 10.89% compilation time: 12% of which was recompilation)

The circumstance is adding packages. The more packages you add, the worse it gets. Note that Makie.jl took >100 seconds! Term.jl took more than a minute. Tullio took 26 seconds.

This is on Julia master, but the problem exists at least since Julia 1.8. It is, however, much worse on Julia master, which has seriously regressed since Julia 1.9-beta.

This is a subset of the dependencies of a real package I use. This is just loading, without actually compiling the code that gets run later... It's a miracle I still have hair.

ParadaCarleton commented 1 year ago

The circumstance is adding packages. The more packages you add, the worse it gets. This is on Julia master, but the problem exists at least since Julia 1.8.

Right, but what was the change in 1.8? Is this just about regular TTFX, a problem in Tullio, or a Julia bug?

ParadaCarleton commented 1 year ago

Yep, no problems with removing Tullio. I'd still like to know what the problem is, so I can avoid it with other packages in the future.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 96.15% and project coverage change: +1.90 :tada:

Comparison is base (b1d603e) 82.59% compared to head (80f0b94) 84.50%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #72 +/- ## ========================================== + Coverage 82.59% 84.50% +1.90% ========================================== Files 11 11 Lines 339 400 +61 ========================================== + Hits 280 338 +58 - Misses 59 62 +3 ``` | [Impacted Files](https://codecov.io/gh/TuringLang/ParetoSmooth.jl/pull/72?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang) | Coverage Δ | | |---|---|---| | [src/GPD.jl](https://codecov.io/gh/TuringLang/ParetoSmooth.jl/pull/72?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang#diff-c3JjL0dQRC5qbA==) | `92.85% <90.90%> (-3.30%)` | :arrow_down: | | [src/LeaveOneOut.jl](https://codecov.io/gh/TuringLang/ParetoSmooth.jl/pull/72?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang#diff-c3JjL0xlYXZlT25lT3V0Lmps) | `88.50% <97.14%> (+4.02%)` | :arrow_up: | | [src/ESS.jl](https://codecov.io/gh/TuringLang/ParetoSmooth.jl/pull/72?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang#diff-c3JjL0VTUy5qbA==) | `81.81% <100.00%> (+6.81%)` | :arrow_up: | | [src/ImportanceSampling.jl](https://codecov.io/gh/TuringLang/ParetoSmooth.jl/pull/72?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang#diff-c3JjL0ltcG9ydGFuY2VTYW1wbGluZy5qbA==) | `80.00% <100.00%> (+0.65%)` | :arrow_up: | | [src/NaiveLPD.jl](https://codecov.io/gh/TuringLang/ParetoSmooth.jl/pull/72?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang#diff-c3JjL05haXZlTFBELmps) | `80.95% <100.00%> (+7.61%)` | :arrow_up: | | [src/ParetoSmooth.jl](https://codecov.io/gh/TuringLang/ParetoSmooth.jl/pull/72?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang#diff-c3JjL1BhcmV0b1Ntb290aC5qbA==) | `83.33% <100.00%> (+3.33%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=TuringLang)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

chriselrod commented 1 year ago

Not a main contributor to this repo but I support this PR. Based on the changes and the tests, it seems Tullio is not needed and hence I think it is reasonable to reduce dependencies and an easy fix for the latency introduced by Tullio. In at least one case this PR also reduces allocations.

Tullio will multithread by default, which results in allocations and adds a little overhead. At large array sizes, multithreading will of course boost performance. To compensate for that, I tried to @inline the exp calls, which would be the dominant cost in most of this code. This allows them to SIMD.

julia> @inline exp_inline(x) = @inline exp(x)
exp_inline (generic function with 1 method)

julia> function mymap!(f, y, x)
           for i = eachindex(y,x)
               y[i] = f(x[i])
           end
       end
mymap! (generic function with 1 method)

julia> x = rand(512); y = similar(x);

julia> @btime mymap!(exp, $y, $x)
  2.448 μs (0 allocations: 0 bytes)

julia> @btime mymap!(exp_inline, $y, $x)
  423.693 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.10.0-DEV.648
Commit 6412a56223* (2023-02-22 04:07 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 36 on 36 virtual cores

julia> using LoopVectorization

julia> @btime vmap!(exp, $y, $x); # we could still get faster
  178.629 ns (0 allocations: 0 bytes)

Compared to multithreading, SIMD is "free" (we may want to use other threads for parallel processing), so IMO this is an improvement. Of course, both are always an option, or we could add Threads.@threads to the outer most loops of the @tullios that didn't disable threading.

ParadaCarleton commented 1 year ago

Both is good. I'd definitely like to let users multithread.

chriselrod commented 1 year ago

Both is good. I'd definitely like to let users multithread.

Any good benchmarks?

ParadaCarleton commented 1 year ago

Both is good. I'd definitely like to let users multithread.

Any good benchmarks?

I don't have benchmarks, but for sufficiently large inputs multithreading is going to be faster. (Anything where it isn't, is probably fast enough not to matter.)

chriselrod commented 1 year ago

Both is good. I'd definitely like to let users multithread.

Any good benchmarks?

I don't have benchmarks, but for sufficiently large inputs multithreading is going to be faster. (Anything where it isn't, is probably fast enough not to matter.)

Do you want to see that change before we merge this and make a release? The current code should provide a performance advantage in most cases.

chriselrod commented 1 year ago

I don't know what representative benchmarks are here, but psis_loo, for example, does not benefit from multithreading currently. With 36 threads:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         psis_loo(model(data), chain, source=:other)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               2.02e+09   35.4%  #  3.6 cycles per ns
└ task-clock               5.53e+08   35.4%  # 553.3 ms
┌ instructions             3.73e+09   68.8%  #  1.9 insns per cycle
│ branch-instructions      6.12e+08   68.8%  # 16.4% of insns
└ branch-misses            2.06e+06   68.8%  #  0.3% of branch insns
┌ L1-dcache-load-misses    3.97e+07   31.2%  #  3.6% of dcache loads
│ L1-dcache-loads          1.09e+09   31.2%
│ cache-misses             1.48e+07   31.2%  # 48.9% of cache refs
└ cache-references         3.04e+07   31.2%
                 aggregated from 36 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @benchmark psis_loo(model($data), $chain, source=:other)
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  387.010 ms … 419.359 ms  ┊ GC (min … max): 9.56% … 10.50%
 Time  (median):     401.450 ms               ┊ GC (median):    9.23%
 Time  (mean ± σ):   402.801 ms ±  10.588 ms  ┊ GC (mean ± σ):  9.36% ±  0.73%

  ▁  ▁  ▁     ▁           ▁  █         ▁  ▁     █       ▁     ▁  
  █▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁█ ▁
  387 ms           Histogram: frequency by time          419 ms <

 Memory estimate: 515.01 MiB, allocs estimate: 8724664.

With 1 thread:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
                psis_loo(model(data), chain, source=:other)
              end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               1.65e+09   33.3%  #  3.6 cycles per ns
└ task-clock               4.61e+08   33.3%  # 461.5 ms
┌ instructions             3.78e+09   66.7%  #  2.3 insns per cycle
│ branch-instructions      6.31e+08   66.7%  # 16.7% of insns
└ branch-misses            2.24e+06   66.7%  #  0.4% of branch insns
┌ L1-dcache-load-misses    3.52e+07   33.3%  #  3.5% of dcache loads
│ L1-dcache-loads          1.01e+09   33.3%
│ cache-misses             1.42e+07   33.3%  # 65.4% of cache refs
└ cache-references         2.17e+07   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @benchmark psis_loo(model($data), $chain, source=:other)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  410.551 ms … 431.658 ms  ┊ GC (min … max): 8.07% … 8.15%
 Time  (median):     419.260 ms               ┊ GC (median):    7.53%
 Time  (mean ± σ):   419.278 ms ±   6.346 ms  ┊ GC (mean ± σ):  7.62% ± 0.35%

  █         ▁   ▁     ▁   ▁▁▁   ▁       ▁          ▁          ▁  
  █▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁█▁▁▁███▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  411 ms           Histogram: frequency by time          432 ms <

 Memory estimate: 504.46 MiB, allocs estimate: 8278898.
chriselrod commented 1 year ago

I looked a little at the code. There are a lot of low hanging fruit that should be taken care of before seriously considering multithreading.

I descend and almost immediately see

pointwise_log_likelihoods(model::DynamicPPL.Model, chains::Chains) in ParetoSmooth at /home/chriselrod/.julia/dev/ParetoSmooth/src/TuringHelpers.jl:23
Body::Any
1 ─ %1  = invoke MCMCChains.Chains(chains::Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, NamedTuple{(:parameters, :internals), Tuple{Vector{Symbol}, Vector{Symbol}}}, NamedTuple{(:start_time, :stop_time), Tuple{Vector{Float64}, Vector{Float64}}}}, (:parameters,)::Tuple{Symbol})::Any
│   %2  = DynamicPPL.pointwise_loglikelihoods::Core.Const(DynamicPPL.pointwise_loglikelihoods)
│   %3  = (%2)(model, %1)::Dict
│   %4  = ParetoSmooth.first(%3)::Pair
│   %5  = Base.getfield(%4, :second)::Any
│   %6  = ParetoSmooth.size(%5)::Any
│   %7  = ParetoSmooth.keys(%3)::Base.KeySet
│   %8  = ParetoSmooth.collect(%7)::Vector
│   %9  = (Base.var"#sort##kw"())($(QuoteNode((by = ParetoSmooth.var"#ind_from_string#62"(),))), ParetoSmooth.sort, %8)::Any
│   %10 = ParetoSmooth.:(var"#61#63")::Core.Const(ParetoSmooth.var"#61#63")
│   %11 = Core.typeof(%6)::DataType
│   %12 = Core.typeof(%3)::Type{Dict{_A, _B}} where {_A, _B}
│   %13 = Core.apply_type(%10, %11, %12)::Type{ParetoSmooth.var"#61#63"{_A, _B}} where {_A, _B}
│   %14 = %new(%13, %6, %3)::ParetoSmooth.var"#61#63"
│   %15 = Base.Generator(%14, %9)::Base.Generator
│   %16 = Base.collect(%15)::Any
│   %17 = ParetoSmooth.reduce(ParetoSmooth.vcat, %16)::Any
└──       return %17
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [o]ptimize, [w]arn, [h]ide type-stable statements, [d]ebuginfo, [r]emarks, [e]ffects, [i]nlining costs, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
Advanced: dump [P]arams cache.
   %1 = invoke Chains(::Chains{…},::Tuple{…})::Any
 • %3 = call → pointwise_loglikelihoods(::Model{…},::Any)::Dict
   %4 = call first(::Dict)::Pair
   %7 = call keys(::Dict)::Base.KeySet
   %8 = call collect(::Base.KeySet)::Vector
   %9 = call sort##kw(::NamedTuple{(:by,), Tuple{ParetoSmooth.var"#ind_from_string#62"}},::typeof(sort),::Vector)::Any
   %15 = call Generator(::ParetoSmooth.var"#61#63",::Any)::Base.Generator

Hopefully the internal core is better, but it's worth noting that it still multithreads to lightly use all 36 available threads with only minor improvement in runtime vs single threaded. If concerned about performance, there is a lot you can work on here before resorting to multiple threads. And if you do want to use multiple threads, I would recommend an approach different from Tullio, hopefully more granular so you can see larger gains.

ParadaCarleton commented 1 year ago

Hopefully the internal core is better, but it's worth noting that it still multithreads to lightly use all 36 available threads with only minor improvement in runtime vs single threaded.

It should be; I benchmarked the internals to find bottlenecks and just focused on that. Some of the code for array wrangling is pretty ugly/inefficient because this was my first serious experience programming in Julia.

chriselrod commented 1 year ago

Hopefully the internal core is better, but it's worth noting that it still multithreads to lightly use all 36 available threads with only minor improvement in runtime vs single threaded.

It should be; I benchmarked the internals to find bottlenecks and just focused on that. Some of the code for array wrangling is pretty ugly/inefficient because this was my first serious experience programming in Julia.

The number of allocations look extremely high, so I am worried that there is an inefficiency in some hot part of the code.