Fitting tools, combined fits, partial wave analysis, and machine learning

jpivarski commented 1 year ago

Extracting statistical results from data with systematics, correlations, etc. at large scale.

maxgalli commented 1 year ago

Interested in this topic! I'm mostly interested in investigating interoperability between Combine (tool that is still used in CMS to perform combinations and fits pretty much in every analysis) and modern packages (pyhf, cabinetry, zfit, etc.).

rkansal47 commented 1 year ago

Also interested in discussing Python-based alternatives to Combine!

lgray commented 1 year ago

jaxfit @nsmith-

vgvassilev commented 1 year ago

I am interested in adopting the work we did in RooFit and AD in Combine.

cc: @grimmmyshini, @guitargeek, @sudo-panda, @davidlange6

pfackeldey commented 1 year ago

I'd like to discuss with experts of the different fitting libraries the benefits and potential drawbacks of using JAX PyTrees. I made a small (proof-of-concept for now) package (dilax), which implements binned likelihood fits with pure JAX and PyTrees. This enables vectorisation on many (new?) levels, such as multiple simultaneous fits for a likelihood profile on a GPU etc. In addition, everything becomes differentiable by construction. After a discussion with @phinate, he started a GitHub-discussion in pyhf: https://github.com/scikit-hep/pyhf/discussions/2196, where all the concepts are written in greater detail.

+1

ccochato commented 1 year ago

+1

redeboer commented 1 year ago

I'm interested in this topic, particularly fitting tools and amplitude analysis / PWA!

Some thoughts:

Amplitude analysis is a bit of a niche, but if there are participants who are developing tools for it (or doing an analysis themselves), we should definitely set up a session for it! A few suggestions:
- potential standardisation and serialization of amplitude models / lineshapes (à la HS3, @jonas-eschle?);
- performance benchmark comparisons (see also this meeting) of the (way too) many fitting frameworks;
- workflow comparisons (highly related to performance benchmarks for amplitude analysis), see also #9;
- improved interoperability with scikit-hep packages and potentially between these frameworks;
- improved documentation on amplitude analysis theory and software (ties in to #7 and the next point about CAS).
Formulating models symbolically with a CAS to use it as a template for numerical backends like JAX, TF, Numba, etc. I gave a talk on this at CHEP2023 and have been using SymPy+JAX for (amplitude) fitting for some years now (see general demo of fitting package here, as well as an example with zfit here). I'm not aware of other HEP projects doing this, so I'm interested to discuss whether it's worth pursuing for other packages and frameworks as well (not just Python).[^1]
General question: what do we mean by ML under the heading of "fitting"? For sure, it's a form of fitting, but my impression is that once ML is used in HEP, it's usually for tracking, background reduction, etc., or tricks that benefit specific (amplitude) analysis. If so, it may be worth to put it under a different heading (perhaps moving to #19 altogether?).

[^1]: Google seems to be trying something along those lines with sympy2jax. And (sorry to mention this at PyHEP), I have the impression that Julia could be used for such a workflow as well with Symbolics.jl and JuliaDiff

redeboer commented 1 year ago

Suggestion for a new topical issue Another idea that is relevant to amplitude analysis, but that we may want to discuss more generally: last month, the PDG announced they now offer an API. Their Python API is still under development, so I feel we as PyHEP community should get involved in its development. Perhaps @eduardo-rodrigues has thoughts on this? Is it worth creating a topical issue on this?

alexander-held commented 1 year ago

I'm also interested in this. Another aspect to this topic is orchestration of model construction / metadata handling, which ties in with earlier steps in an analysis workflow (and #4). Regarding AD: also curious to learn more about how / how much of the functionality is exposed to users (i.e. can I easily take arbitrary derivatives myself, how limited are current implementations to just internally provide derivatives wrt. parameters to the minimizer).

@redeboer: probably best to open a new issue as this might get lost here and is not related to the thread. Presumably interesting e.g. for scikit-hep/particle.

eduardo-rodrigues commented 1 year ago

Hi folks. Thanks for the ping. I'm aware of the new PDG API and in fact in touch with Juerg, the director :-). I do need to find time to have a proper look and comment ... But it is not forgotten and indeed a relevant thing how Particle sits/evolves vis-a-vis the new pdg package.

redeboer commented 1 year ago

@redeboer: probably best to open a new issue as this might get lost here and is not related to the thread. Presumably interesting e.g. for scikit-hep/particle.

✅ --> https://github.com/scikit-hep/particle/issues/513

alexander-held commented 1 year ago

General question: what do we mean by ML under the heading of "fitting"?

One thing that fits into that box is simulation-based inference à la e.g. MadMiner or various anomaly detection methods.

phinate commented 1 year ago

can't attend in-person sadly but would love to be involved in any discussions here if possible (timezones permitting)!

mdsokoloff commented 1 year ago

Hi All: I've been working on GooFit (https://github.com/GooFit/GooFit) for a decade now. One of its primary goals is doing time-dependent amplitude analyses with large data sets (think hundreds of thousands of events to millions). While all the underlying code is C++, the package has Python bindings for most methods. In addition, the (Python) DecayLanguage package that lives in SciKit (https://github.com/scikit-hep/decaylanguage) produces CUDA code for GooFit from AmpGen decay descriptor files (https://github.com/GooFit/AmpGen).

GooFit sounds like RooFit and its user interface mimics that of RooFit in many ways. It runs on nVidia GPUs, under OpenMP on x86 servers, and on single CPUs (the last is useful for debugging).

While GooFit has been used primarily for amplitude analyses, it can also be used effectively for coverage tests fitting simple one-dimensional functions, etc.

I am very interested in using AD within GooFit. From preliminary discussions with experts, GooFit's architecture should allow us to use/adapt Clad (https://compiler-research.org/clad/) in a fairly straight-forward way.

At the end of the day, we would like to make most of the functionality of GooFit available to users using Python interfaces that do not require developing new C++ code. It will be very interesting to see what a possible user community wants to do.

nsmith- commented 1 year ago

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

@redeboer I only have thought about sympy-assisted model building but glad to see you've made real progress there!
I'll +1 the HEP Statistics Serialization Standard HS3 as a topic worth discussing.
Perhaps it is time to revitalize https://github.com/scikit-hep/pyhf/issues/608 towards a common statistical interpretation API (evolution of RooStats, say), as this plus a serialization standard allows to switch out backends as necessary for performance.

In my experience in attempting a jax port of the CMS Higgs combination, I found the many un-vectorized parameters we have becomes a debilitating JIT compilation bottleneck in jax. But this situation may have changed since I checked back in 2021.

phinate commented 1 year ago

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

@nsmith- Is this better-scoped as a statistical modelling package, where one would find the appropriate abstraction that fits both binned/unbinned paradigms? Inference would just be extra layers on minimization, which I've already abstracted in relaxed for the most common cases encountered in pyhf (profile likelihood-based testing) -- the only important API requirement is the existence of the .logpdf method. (Upper limits are a small extension over that with a root-finder).

nsmith- commented 1 year ago

@phinate yes! I guess your relaxed is an implementation of https://github.com/scikit-hep/pyhf/issues/608 ?

phinate commented 1 year ago

@phinate yes! I guess your relaxed is an implementation of https://github.com/scikit-hep/pyhf/issues/608 ?

oh, I suppose so in a not-well-tested kind of way :) just asymptotic calcs though, and probably needs a quick going-through to truly be agnostic to the model representation but it is just a thin wrapper around jaxopt with HEP-like quantities/semantics!

would be happy to build this out more to support whatever model abstraction we can come up with!

JMolinaHN commented 1 year ago

Hi everyone, I wanted to bring up a key point concerning Amplitude Analysis: the integration of the Probability Density Function (PDF). The speed of convergence hinges significantly on this aspect, and it's why parallel processing becomes crucial, particularly for processing large datasets with intricate integrals. Tools like GooFit have been invaluable in this regard, standing out as some of the best available solutions for this type of processing.

However, given the advancements in today's computational capabilities, I believe it might be beneficial to explore alternative approaches. For instance, we could consider precomputing the integrals and devising an efficient method for accessing these values as necessary. Another potential strategy could be experimenting with a Chi-squared (Chi2) fit with reduced granularity. While this is typically quite fast, it does reintroduce the challenge of integration.

Beyond these technical aspects, there's another issue I've been considering: the generalization and user-level accessibility of fitting tools. It often feels like we lack a consistent standard across fitting tools. For instance, finding a tool that effectively handles both B and D decays can be challenging. Similarly, analyzing decays of more than three bodies can become complex, often requiring custom or adapted code that can be hard to decipher.

We need to address the readability of these codes and work towards creating user-level code that interfaces with the base code. Again, I bring up GooFit as an example - it does a great job of shielding the user from the intricacies of CUDA code to perform an analysis. Despite this, I find that there's room for improvement in the user experience, and I believe it would be fruitful for us to discuss these issues during the workshop.

redeboer commented 1 year ago

We need to address the readability of these codes and work towards creating user-level code that interfaces with the base code. Again, I bring up GooFit as an example - it does a great job of shielding the user from the intricacies of CUDA code to perform an analysis. Despite this, I find that there's room for improvement in the user experience, and I believe it would be fruitful for us to discuss these issues during the workshop.

I fully agree!

Is it an idea to organise a dedicated session for amplitude analysis (UX and documentation specifically)? If so, who would be interested? @JMolinaHN @mdsokoloff @jonas-eschle?

JMolinaHN commented 1 year ago

@redeboer of course a discussion on amplitude analysis would be more than interesting! (in view of the latest results, I think we need it). From my point of view, I refuse to think that it can't be done a likelihood analysis in some decays like Dpipipi or Dkpipi. We all know those decays are challenging because of the pipi (in general, pp) but in some sense we should be adecuate (sensitive) to problems like that.

mattbellis commented 1 year ago

+1

ianna commented 1 year ago

+1

nikoladze commented 1 year ago

+1

jonas-eschle commented 1 year ago

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

@nsmith- Is this better-scoped as a statistical modelling package, where one would find the appropriate abstraction that fits both binned/unbinned paradigms? Inference would just be extra layers on minimization, which I've already abstracted in relaxed for the most common cases encountered in pyhf (profile likelihood-based testing) -- the only important API requirement is the existence of the .logpdf method. (Upper limits are a small extension over that with a root-finder).

This is basically what zfit already solves, it combines binned and unbinned (and mixed) fits. I think it's crucially more than relaxed, which allows to use histogram templates as an unbinned PDF (afaiu), but there is more to that: analytic shapes, numerical integration & sampling methods, arbitrary correlations etc.

I also agree with the others, to point the three main topics that I see:

fitting tools landscape and interface (how to define an interface for a distribution, parameters etc). There are a lot around but all with their purpose and the main goal should be to bring them closer together
backends: Sympy and JAX are some popular ones, but they're not without drawbacks (i.e. TensorFlow is partially more powerful than JAX, there is aesara (and ping to @redeboer ;) ) that improves Sympy expressions). How can all of these work together best?
statistical language and serialization standard (HS3, decaylanguage, pyhf histfactory json) to have a common format of serializing (and publishing/exchanging!) likelihoods and models, also Amplitude fit models.

nsmith- commented 1 year ago

zfit already solves, it combines binned and unbinned (and mixed) fits

In this regard, zfit and RooFit are alone at the moment. What I would like to understand is how their representations of mixed binned-unbinned data compare/contrast.

As an aside, Combine also can produce unbinned "pseudo-Asimov" datasets to take advantage of asymptotic methods. Is this something done elsewhere? (I am just ignorant here)

TensorFlow is partially more powerful than JAX

Curious about this!

alexander-held commented 1 year ago

As an aside, Combine also can produce unbinned "pseudo-Asimov" datasets to take advantage of asymptotic methods.

@nsmith- I'm curious to learn more about this. Is this in the docs?

nsmith- commented 1 year ago

There is a brief discussion here http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#asimov-datasets

matthewfeickert commented 1 year ago

This topic seems perhaps too broad, and while I expect that during the week it will split out across different areas organically the areas that I think I'm most probable to spend time discussing are:

Moving towards tools accepting and revising HS3. :heavy_check_mark:
- There has already been a great deal of work done by people here (not me as I dropped the ball on this early in 2023 — sorry all) and I think making some headway here in addition to the meetings that have happened would be useful.
Along the lines of standards, having tools in the Pythonic fitting space look at the Array API and use of array-api-compat
- c.f. https://github.com/scikit-hep/pyhf/issues/2253
- c.f. Aaron Meurer's SciPy 2023 talk: Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem
27

HSF / PyHEP.dev-workshops

Fitting tools, combined fits, partial wave analysis, and machine learning #5

27