HSF / PyHEP.dev-workshops

PyHEP Developer workshops
https://indico.cern.ch/e/PyHEP2023.dev
BSD 3-Clause "New" or "Revised" License
9 stars 1 forks source link

Fitting tools, combined fits, partial wave analysis, and machine learning #5

Closed jpivarski closed 10 months ago

jpivarski commented 1 year ago

Extracting statistical results from data with systematics, correlations, etc. at large scale.

maxgalli commented 1 year ago

Interested in this topic! I'm mostly interested in investigating interoperability between Combine (tool that is still used in CMS to perform combinations and fits pretty much in every analysis) and modern packages (pyhf, cabinetry, zfit, etc.).

rkansal47 commented 1 year ago

Also interested in discussing Python-based alternatives to Combine!

lgray commented 1 year ago

jaxfit @nsmith-

vgvassilev commented 1 year ago

I am interested in adopting the work we did in RooFit and AD in Combine.

cc: @grimmmyshini, @guitargeek, @sudo-panda, @davidlange6

pfackeldey commented 1 year ago

I'd like to discuss with experts of the different fitting libraries the benefits and potential drawbacks of using JAX PyTrees. I made a small (proof-of-concept for now) package (dilax), which implements binned likelihood fits with pure JAX and PyTrees. This enables vectorisation on many (new?) levels, such as multiple simultaneous fits for a likelihood profile on a GPU etc. In addition, everything becomes differentiable by construction. After a discussion with @phinate, he started a GitHub-discussion in pyhf: https://github.com/scikit-hep/pyhf/discussions/2196, where all the concepts are written in greater detail.

+1

ccochato commented 1 year ago

+1

redeboer commented 1 year ago

I'm interested in this topic, particularly fitting tools and amplitude analysis / PWA!

Some thoughts:

[^1]: Google seems to be trying something along those lines with sympy2jax. And (sorry to mention this at PyHEP), I have the impression that Julia could be used for such a workflow as well with Symbolics.jl and JuliaDiff

redeboer commented 1 year ago

Suggestion for a new topical issue Another idea that is relevant to amplitude analysis, but that we may want to discuss more generally: last month, the PDG announced they now offer an API. Their Python API is still under development, so I feel we as PyHEP community should get involved in its development. Perhaps @eduardo-rodrigues has thoughts on this? Is it worth creating a topical issue on this?

alexander-held commented 1 year ago

I'm also interested in this. Another aspect to this topic is orchestration of model construction / metadata handling, which ties in with earlier steps in an analysis workflow (and #4). Regarding AD: also curious to learn more about how / how much of the functionality is exposed to users (i.e. can I easily take arbitrary derivatives myself, how limited are current implementations to just internally provide derivatives wrt. parameters to the minimizer).

@redeboer: probably best to open a new issue as this might get lost here and is not related to the thread. Presumably interesting e.g. for scikit-hep/particle.

eduardo-rodrigues commented 1 year ago

Hi folks. Thanks for the ping. I'm aware of the new PDG API and in fact in touch with Juerg, the director :-). I do need to find time to have a proper look and comment ... But it is not forgotten and indeed a relevant thing how Particle sits/evolves vis-a-vis the new pdg package.

redeboer commented 1 year ago

@redeboer: probably best to open a new issue as this might get lost here and is not related to the thread. Presumably interesting e.g. for scikit-hep/particle.

✅ --> https://github.com/scikit-hep/particle/issues/513

alexander-held commented 1 year ago

General question: what do we mean by ML under the heading of "fitting"?

One thing that fits into that box is simulation-based inference à la e.g. MadMiner or various anomaly detection methods.

phinate commented 1 year ago

can't attend in-person sadly but would love to be involved in any discussions here if possible (timezones permitting)!

mdsokoloff commented 1 year ago

Hi All: I've been working on GooFit (https://github.com/GooFit/GooFit) for a decade now. One of its primary goals is doing time-dependent amplitude analyses with large data sets (think hundreds of thousands of events to millions). While all the underlying code is C++, the package has Python bindings for most methods. In addition, the (Python) DecayLanguage package that lives in SciKit (https://github.com/scikit-hep/decaylanguage) produces CUDA code for GooFit from AmpGen decay descriptor files (https://github.com/GooFit/AmpGen).

GooFit sounds like RooFit and its user interface mimics that of RooFit in many ways. It runs on nVidia GPUs, under OpenMP on x86 servers, and on single CPUs (the last is useful for debugging).

While GooFit has been used primarily for amplitude analyses, it can also be used effectively for coverage tests fitting simple one-dimensional functions, etc.

I am very interested in using AD within GooFit. From preliminary discussions with experts, GooFit's architecture should allow us to use/adapt Clad (https://compiler-research.org/clad/) in a fairly straight-forward way.

At the end of the day, we would like to make most of the functionality of GooFit available to users using Python interfaces that do not require developing new C++ code. It will be very interesting to see what a possible user community wants to do.

nsmith- commented 1 year ago

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

In my experience in attempting a jax port of the CMS Higgs combination, I found the many un-vectorized parameters we have becomes a debilitating JIT compilation bottleneck in jax. But this situation may have changed since I checked back in 2021.

phinate commented 1 year ago

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

@nsmith- Is this better-scoped as a statistical modelling package, where one would find the appropriate abstraction that fits both binned/unbinned paradigms? Inference would just be extra layers on minimization, which I've already abstracted in relaxed for the most common cases encountered in pyhf (profile likelihood-based testing) -- the only important API requirement is the existence of the .logpdf method. (Upper limits are a small extension over that with a root-finder).

nsmith- commented 1 year ago

@phinate yes! I guess your relaxed is an implementation of https://github.com/scikit-hep/pyhf/issues/608 ?

phinate commented 1 year ago

@phinate yes! I guess your relaxed is an implementation of https://github.com/scikit-hep/pyhf/issues/608 ?

oh, I suppose so in a not-well-tested kind of way :) just asymptotic calcs though, and probably needs a quick going-through to truly be agnostic to the model representation but it is just a thin wrapper around jaxopt with HEP-like quantities/semantics!

would be happy to build this out more to support whatever model abstraction we can come up with!

JMolinaHN commented 1 year ago

Hi everyone, I wanted to bring up a key point concerning Amplitude Analysis: the integration of the Probability Density Function (PDF). The speed of convergence hinges significantly on this aspect, and it's why parallel processing becomes crucial, particularly for processing large datasets with intricate integrals. Tools like GooFit have been invaluable in this regard, standing out as some of the best available solutions for this type of processing.

However, given the advancements in today's computational capabilities, I believe it might be beneficial to explore alternative approaches. For instance, we could consider precomputing the integrals and devising an efficient method for accessing these values as necessary. Another potential strategy could be experimenting with a Chi-squared (Chi2) fit with reduced granularity. While this is typically quite fast, it does reintroduce the challenge of integration.

Beyond these technical aspects, there's another issue I've been considering: the generalization and user-level accessibility of fitting tools. It often feels like we lack a consistent standard across fitting tools. For instance, finding a tool that effectively handles both B and D decays can be challenging. Similarly, analyzing decays of more than three bodies can become complex, often requiring custom or adapted code that can be hard to decipher.

We need to address the readability of these codes and work towards creating user-level code that interfaces with the base code. Again, I bring up GooFit as an example - it does a great job of shielding the user from the intricacies of CUDA code to perform an analysis. Despite this, I find that there's room for improvement in the user experience, and I believe it would be fruitful for us to discuss these issues during the workshop.

redeboer commented 1 year ago

We need to address the readability of these codes and work towards creating user-level code that interfaces with the base code. Again, I bring up GooFit as an example - it does a great job of shielding the user from the intricacies of CUDA code to perform an analysis. Despite this, I find that there's room for improvement in the user experience, and I believe it would be fruitful for us to discuss these issues during the workshop.

I fully agree!

Is it an idea to organise a dedicated session for amplitude analysis (UX and documentation specifically)? If so, who would be interested? @JMolinaHN @mdsokoloff @jonas-eschle?

JMolinaHN commented 1 year ago

@redeboer of course a discussion on amplitude analysis would be more than interesting! (in view of the latest results, I think we need it). From my point of view, I refuse to think that it can't be done a likelihood analysis in some decays like Dpipipi or Dkpipi. We all know those decays are challenging because of the pipi (in general, pp) but in some sense we should be adecuate (sensitive) to problems like that.

mattbellis commented 1 year ago

+1

ianna commented 1 year ago

+1

nikoladze commented 1 year ago

+1

jonas-eschle commented 1 year ago

I'm very interested in a jax-based statistical inference package, towards both binned and un-binned fits.

@nsmith- Is this better-scoped as a statistical modelling package, where one would find the appropriate abstraction that fits both binned/unbinned paradigms? Inference would just be extra layers on minimization, which I've already abstracted in relaxed for the most common cases encountered in pyhf (profile likelihood-based testing) -- the only important API requirement is the existence of the .logpdf method. (Upper limits are a small extension over that with a root-finder).

This is basically what zfit already solves, it combines binned and unbinned (and mixed) fits. I think it's crucially more than relaxed, which allows to use histogram templates as an unbinned PDF (afaiu), but there is more to that: analytic shapes, numerical integration & sampling methods, arbitrary correlations etc.

I also agree with the others, to point the three main topics that I see:

nsmith- commented 1 year ago

zfit already solves, it combines binned and unbinned (and mixed) fits

In this regard, zfit and RooFit are alone at the moment. What I would like to understand is how their representations of mixed binned-unbinned data compare/contrast.

As an aside, Combine also can produce unbinned "pseudo-Asimov" datasets to take advantage of asymptotic methods. Is this something done elsewhere? (I am just ignorant here)

TensorFlow is partially more powerful than JAX

Curious about this!

alexander-held commented 1 year ago

As an aside, Combine also can produce unbinned "pseudo-Asimov" datasets to take advantage of asymptotic methods.

@nsmith- I'm curious to learn more about this. Is this in the docs?

nsmith- commented 1 year ago

There is a brief discussion here http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#asimov-datasets

matthewfeickert commented 1 year ago

This topic seems perhaps too broad, and while I expect that during the week it will split out across different areas organically the areas that I think I'm most probable to spend time discussing are: