flav-io / flavio

A Python package for flavour physics phenomenology in the Standard model and beyond
http://flav-io.github.io/
MIT License
71 stars 62 forks source link

Theory citation function #123

Closed MJKirk closed 3 years ago

MJKirk commented 4 years ago

A start on the theory citation function as discussed in #120 - just the basics so far, not to be merged yet.

coveralls commented 4 years ago

Coverage Status

Coverage increased (+0.02%) to 93.886% when pulling 13f3946d41d98e4d28be1c28b8f41337185899ea on MJKirk:theory-citation-function into 7e0502c46098cd675e935fc198d4a82af76bf8ee on flav-io:master.

MJKirk commented 4 years ago

One issue that has occurred to me: In some cases, a lot of the "citation dependence" for a theory prediction lives in the parameters. E.g for an SM prediction of meson mixing, I would normally cite the Inami-Lim loop function, the QCD RG factors eta, the CKM elements, and the decay constants and bag parameters - in flavio the loop function is done in code (and now cited in https://github.com/MJKirk/flavio/commit/8be3df2948370f30606d316d7b853de75296493d), but all the others are just parameters. I would imagine this is true for many other observables. I'm not sure if there's a simple way of dealing with this - I guess what's needed is to add citation info into the parameter data files, but only register it when they get used, not when the parameter files are read.

peterstangl commented 4 years ago

One issue that has occurred to me: In some cases, a lot of the "citation dependence" for a theory prediction lives in the parameters. E.g for an SM prediction of meson mixing, I would normally cite the Inami-Lim loop function, the QCD RG factors eta, the CKM elements, and the decay constants and bag parameters - in flavio the loop function is done in code (and now cited in MJKirk@8be3df2), but all the others are just parameters. I would imagine this is true for many other observables. I'm not sure if there's a simple way of dealing with this - I guess what's needed is to add citation info into the parameter data files, but only register it when they get used, not when the parameter files are read.

One idea would be to have all parameters stored not in a traditional dict, but in some new subclass of dict that registers a reference anytime a certain key is accessed. This could look similar to

class dict_cite(dict):

    def __getitem__(self, key):
        citations.register_key(key) 
        return dict.__getitem__(self, key)

    def get(self, key, *args, **kwargs):
        citations.register_key(key)
        return dict.get(self, key, *args, **kwargs)

But this might also lead to some problems. E.g. if you show the content of such a dict in a Jupyter notebook, it would access all keys and thus register the references for all its items. One work around would be to allow switching on and off the citations.register_key method in such a way that it ignores any calls when it is switched off (e.g. by setting some attribute of the Citations class to False). Then the citations.register_key method could e.g. be switched on and off by the Prediction class such that it would only register the references if a key of the parameter dictionary is accessed by the Prediction class.

DavidMStraub commented 4 years ago

What about flavio.functions.AwareDict?

MJKirk commented 4 years ago

Yes, something exactly like AwareDict would work - a ParameterCitationsAwareDict as it were (a snappier name might be in order).

Regarding the inspire info for the parameters, currently the parameter data is split over 3 files: parameters_uncorrelated, parameters_correlated, and parameter_metadata. We could just put the inspire key into the metadata file, but this seems bad since the references could easily not be updated when the numbers get changed. It seems like it would be better (if more complicated) to merge all these into a single parameter file, much like the single measurements file, and then it will be obvious to update the inspire at the same time as the numerics. (Actually, it looks like very early on it used to be this way and then the file was split in b30aa73b4f26df457d4ffe313c65df45efcddc46, do you remember why you did this @DavidMStraub?)

DavidMStraub commented 4 years ago

Wow, I don't remember that it has ever been a single file.

Splitting them I guess is motivated by metadata populating the attributes of Parameter instances - which are meant not to change - while values populate the ParameterConstraints instances, which can be swapped.

From this angle, it makes sense for citations not to be part of the metadata, but of the ParameterConstraints.

Unfortunately, it will not be entirely trivial to refactor ParameterConstraints to have citations associated with constraints. The implementation with the _constraints list and _parameters dict is suboptimal. If I would implement it today, I would probably add a Constraint class, that associates a single ProbabilityDistrubution with one or multiple parameters. This could then have a citation attribute. The Constraints class would then basically be a container class with a list of Constraints, and various helper methods, e.g. to make sure the collection of constraints is consistent.

peterstangl commented 4 years ago

Thanks, these are all good points! But since they are only indirectly related to the theory citation function, I opened the new issue #124 for further discussion.

MJKirk commented 4 years ago

Notes for myself on what I'm doing

MJKirk commented 3 years ago

So I've now gone through and added citations based arxiv references in the comments / docstrings.

The release notes for v2 mention a few specific papers - for the Higgs physics I've added citations for the paper by David and Adam Falkowski, while for the beta decay and the updated B->D(*) form factors the "new stuff" is basically just in new parameters, which goes back to our discussion earlier about citing parameters (and then became #124).

There are two things which seem like they should come from somewhere specific but have no comments: 1) The ee->WW scattering https://github.com/flav-io/flavio/blob/201a4aa1e1937e5e03d6ded29df76cfa9c14e7fd/flavio/physics/scattering/ee_ww.py#L1 All the specific numbers in the formulae suggest to me they were pulled from somewhere. 2) Neutrino trident https://github.com/flav-io/flavio/blob/201a4aa1e1937e5e03d6ded29df76cfa9c14e7fd/flavio/physics/neutrinos/trident.py#L33 This general expression isn't in the "original" nu trident paper 1406.2332, and it's not obvious to me that everything else drops out such that you only need the wilson coefficients. Was this expression taken from a paper or maybe it is sufficiently simple with a bit of thought.

Both observables were added by @DavidMStraub so he should know.

MJKirk commented 3 years ago

What are the next steps here Peter? Other than the two observables mentioned above, I think there's now a pretty complete coverage of theory papers used. I went through and looked at all the files that don't cite anything, and I'm reasonably happy none of them need to. However, I think it would be worth someone with a better knowledge of the B decay literature than me checking everything in the bdecay folder to see if I've missed things.

I guess at least one more thing to do is proper documentation for this new feature.

peterstangl commented 3 years ago

What are the next steps here Peter? Other than the two observables mentioned above, I think there's now a pretty complete coverage of theory papers used. I went through and looked at all the files that don't cite anything, and I'm reasonably happy none of them need to. However, I think it would be worth someone with a better knowledge of the B decay literature than me checking everything in the bdecay folder to see if I've missed things.

I guess at least one more thing to do is proper documentation for this new feature.

We can still add more references later and I think we don't have to wait with merging until we are sure that we did not miss any. Also, at some point, I would like to work on issue #124, which will allow to include references related to parameters and I think we also don't have to wait for that to merge this PR.

I thought, since one always has to call flavio.default_citations.register(), one could maybe simplify this a bit by defining a function in citations.py like e.g.

def register_citation(inspire_id)
    default_citations.register(inspire_id)

and add this function to flavio/__init__.py such that one can simply add references to the default_citations instance by calling

flavio.register_citation(inspire_id)
MJKirk commented 3 years ago

Okay, great, we can basically wrap this pull request up soon then. I'd agree with the simplification for the citation method, let me clean that up.

peterstangl commented 3 years ago

I was thinking about the following points:

@MJKirk @DavidMStraub what do you think?

DavidMStraub commented 3 years ago

Yes, good points. My idea of default_citations was a bad one. And I agree one should be able to reset them: let's say I want to collect all citations for a given function call in a Jupyter notebook, and then compare it to all citations from another function call.

The other question is whether it's actually a good idea to use a global variable for that. It's not thread safe. That might not be a problem in practice, but it somehow doesn't seem right.

One could also think about a context maneger pattern:

with flavio.citation_scope() as citations:
    # ... do something
    my_citations = list(citations)
peterstangl commented 3 years ago

That's also good points! Thread safety might actually be important since multiprocessing is even used within flavio, e.g. when computing covariance matrices. And it would be quite inconvenient if the citation feature would not work in such a case. Getting all the references for computing a covariance matrix seems to be a reasonable task that should not require computing the covariance matrix on a single thread.

But how can the citation feature be made thread safe? I have never really thought about such problems before but I think it requires storing the references in some object that can be shared between different processes.

One possibility that I found is to use multiprocessing.Manager() to create a shared dictionary. The dictionary's keys can be used like a set and all values can be set to None. A small example code is

import multiprocessing

manager = multiprocessing.Manager()
shared_dict = manager.dict()

def job(string):
    shared_dict[string] = None

pool = multiprocessing.Pool(2)
pool.map(job, ['a', 'b', 'c', 'a'])
pool.close()
pool.join()

Then shared_dict.keys() will return ['a','b','c']. Since we don't need different instances of the Citations class, we can just define a class attribute that is a shared dictionary and only use class methods. citations.py could then look like

import multiprocessing

class Citations:

    _papers_to_cite =  multiprocessing.Manager().dict()

    @classmethod
    def reset(cls):
        cls._papers_to_cite.clear()

    @classmethod
    def register(cls, inspire_key):
        cls._papers_to_cite[inspire_key] = None

    @classmethod
    def to_set(cls):
        return set(cls._papers_to_cite.keys())

    @classmethod
    def to_string(cls):
        return ",".join(cls._papers_to_cite.keys())

__init__.py would than just need the line

from .citations import Citations

and references could be added by simply calling

flavio.Citations.register('my_inspire_key')

I don't know if it's actually a good idea to implement it like this, but a first try works as expected:

import multiprocessing
import flavio

def job(string):
    flavio.Citations.register(string)

pool = multiprocessing.Pool(2)
pool.map(job, ['a', 'b', 'c', 'a'])
pool.close()
pool.join()

Then flavio.Citations.to_set() returns {'a', 'b', 'c'}.

DavidMStraub commented 3 years ago

Ok, I realize I opened a can of worms there, hadn't considered how tricky it actually is to make this thread safe. But this multiprocessing.Manager is a cool idea, I was not aware of it. Certainly more elegant than using a file (which is the other option I could think of). Do you know if it also works when using threading or multiprocess?

Concerning the API, what you are suggesting is a singleton pattern, but I think in this case there is absolutely no benefit in making this a class. It's even confusing as a user could just to citations = flavio.citations.Citations(), but they are not supposed to. Consider the following, essentially equivalent code:

citations.py

_PAPERS_TO_CITE =  multiprocessing.Manager().dict()

def reset(cls):
    _PAPERS_TO_CITE.clear()

def register(cls, inspire_key):
    _PAPERS_TO_CITE[inspire_key] = None

def to_set(cls):
    return set(_PAPERS_TO_CITE.keys())

def to_string(cls):
    return ",".join(_PAPERS_TO_CITE.keys())

It's simply a module, not a class, and we can do

flavio.citations.register('my_inspire_key')
peterstangl commented 3 years ago

Do you know if it also works when using threading or multiprocess?

threading is not a problem for both multiprocess.Manager() and multiprocessing.Manager(). However, unfortunately multiprocessing.Manager() doesn't work with multiprocess and multiprocess.Manager() doesn't work with multiprocessing. But it is possible to switch between different managers during runtime, which would still allow to use e.g. multiprocess instead of multiprocessing (see below).

Concerning the API, what you are suggesting is a singleton pattern, but I think in this case there is absolutely no benefit in making this a class. It's even confusing as a user could just to citations = flavio.citations.Citations(), but they are not supposed to. Consider the following, essentially equivalent code: [...] It's simply a module, not a class, and we can do

flavio.citations.register('my_inspire_key')

I agree that having a class with only class methods was maybe not the best idea. I think it's a very good idea to use a module. But this module can actually be a class instance that has nice features like e.g. properties. This is described e.g. at https://stackoverflow.com/questions/2447353/getattr-on-a-module/7668273#7668273 after "Update" .

I have a working implementation based on this that makes the following things possible:

All of this is achieved with the following citations.py:

from multiprocessing import Manager
import flavio
import sys

class CitationScope(object):

    def __enter__(self):
        self._citations_global = flavio.citations._papers_to_cite
        flavio.citations._papers_to_cite = flavio.citations._manager.dict()
        return flavio.citations

    def __exit__(self, type, value, traceback):
        flavio.citations._papers_to_cite = self._citations_global

class Citations:

    _manager = Manager()
    scope = CitationScope()

    def __init__(self):
        self._papers_to_cite = self._manager.dict()
        self.register("Straub:2018kue")

    def __iter__(self):
        for citation in self._papers_to_cite.keys():
            yield citation

    def __str__(self):
        return ",".join(self._papers_to_cite.keys())

    @property
    def string(self):
        return str(self)

    @property
    def set(self):
        return set(self)

    def register(self, inspire_key):
        self._papers_to_cite[inspire_key] = None

    def clear(self):
        self._papers_to_cite.clear()

    def reset(self):
        self.clear()
        self.register("Straub:2018kue")

    def switch_manager(self, manager):
        self._manager = manager
        self.__init__()

sys.modules[__name__] = Citations()

If we agree that this is actually a good solution for the points we were discussing, I can push my version to the MJKirk:theory-citation-function branch if @MJKirk agrees and doesn't have any uncommited changes there.

DavidMStraub commented 3 years ago

Wow, interesting hack :) Yes, I think it's almost perfect. Only the context manager doesn't convince me, because it modifies the singleton and then back again. I think in this case the proper thing to do would be to create a new instance and leave the global one intact. Then I also wouldn't use a property for the context manager, but a method. Perhaps

with flavio.citations.collect() as citations:

would be an appropriate name to indicate that this collects all citations in its scope.

peterstangl commented 3 years ago

I agree, the context manager was not very convincing. I have implemented your suggestions, which has actually the nice side effect that since the context manager is a method now, it can receive a Manager() instance as an argument. This makes using e.g. multiprocess more convenient:

with flavio.citations.collect(multiprocess.Manager()) as citations:
    pool = multiprocess.Pool(2)
    pool.map(job, ['a','b','c','a'])
    pool.close()
    pool.join()

this manager is also passed on to the context manager used inside the theory_citations method of Observable, which looks like

    def theory_citations(self, *args, **kwargs):
        with flavio.citations.collect() as citations:
            flavio.sm_prediction(self.name, *args, **kwargs)
        return citations.set

I have made also some other improvements, in particular one doesn't lose the references anymore when using the switch_manager method and the initial citations are now passed as an argument to the Citations class instead of hard coding them inside the class. The new version of citations.py looks as follows:

from multiprocessing import Manager
import flavio
import sys
from itertools import chain

class CitationScope:

    def __init__(self, manager=None):
        if manager is not None:
            self._manager = manager
        else:
            self._manager = flavio.citations._manager

    def __enter__(self):
        self._citations_global = flavio.citations
        flavio.citations = Citations(self._manager)
        return flavio.citations

    def __exit__(self, type, value, traceback):
        flavio.citations = self._citations_global

class Citations:

    collect = CitationScope

    def __init__(self, manager, initial_citations=[], copied_citations=[]):
        self._manager = manager
        self._initial_citations = initial_citations
        self._papers_to_cite = self._manager.dict()
        self._papers_to_cite.update(
            {k:None for k in chain(initial_citations, copied_citations)}
        )

    def __iter__(self):
        for citation in self._papers_to_cite.keys():
            yield citation

    def __str__(self):
        return ",".join(self._papers_to_cite.keys())

    @property
    def string(self):
        return str(self)

    @property
    def set(self):
        return set(self)

    def register(self, inspire_key):
        self._papers_to_cite[inspire_key] = None

    def clear(self):
        self._papers_to_cite.clear()

    def reset(self):
        self.clear()
        self._papers_to_cite.update({k:None for k in self._initial_citations})

    def switch_manager(self, manager):
        self.__init__(manager, self._initial_citations, self.set)

sys.modules[__name__] = Citations(Manager(), ["Straub:2018kue"])
MJKirk commented 3 years ago

Okay, having caught up with the discussion it all sounds good, and the new code looks nice to me. I don't have anything uncommitted so feel free to push your code @peterstangl.

peterstangl commented 3 years ago

I have thought about this again and also considered the computing speed of the approach. I noticed that these Manager().dict() objects are actually very slow. In particular, writing a single entry to such a shared dict took me around 20 μs, while the same write to a normal dict took me only 30 ns... Given how often the register() method might actually be called during the computation of theory predictions, I'm afraid the current approach is a bit impractical.

I played around a bit with other shared objects that might be faster. I tried out multiprocessing.Queue, which is considerably faster and takes only 1 μs for writing an entry. However, I ran into a problem described at https://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming under 'Joining processes that use queues' that led to the execution being blocked if pool.join() is called before the queue is emptied. So I think queues cannot help us here.

Another shared object that is even faster is multiprocessing.Array. However, such an array has a fixed length and is not convenient for storing text strings. Nevertheless, I came up with a way of using it and in my implementation, a call to the register() method takes only 600 ns, which might be fast enough. Furthermore, this approach works with multiprocessing, multiprocess, and threading out of the box and there is no need to use Manager(). The drawback of my implementation is that I have to get a set of all possible citations, which I currently do by using a regexp on all .py files in the flavio source folder. This is admittedly a bit ugly. Having this set, I enumerate it and use a boolean array of the same length in which register() switches the entry corresponding to a given inspire id to True. This means that the possible arguments of register() are limited to those contained in the flavio source and register() cannot be called with other arguments by the user. But this I wouldn't consider to be a too big problem since the citation functionality is mostly about adding references to things in the flavio source code.

So in the end, I think we have the following possibilities:

@MJKirk @DavidMStraub what do you think?

DavidMStraub commented 3 years ago

Oh... good that you noticed this speed penalty, would have been a pity to slow things down. To be honest I think the solution with the regex is over-engineering. I suggest to postpone this problem of thread safety and keep it simple for the moment, i.e. with a dict. We can document that citations are not expected to work with sm_covariance.

MJKirk commented 3 years ago

It seems to me that it's not just sm_covariance but (probably all) things with the threads option set != 1 that don't currently work. Which then breaks what I had in mind of being the standard use of this, just calling print(flavio.default_citations) at the end of your code without any other modifications and getting the citations. Taking an example of some simple fit to the flavour anomalies:

import flavio
from flavio.statistics.likelihood import FastLikelihood
from wilson import Wilson
import flavio.plots as fpl

my_obs = (
    ("<Rmue>(B+->Kll)", 1.1, 6.0),
    ("<Rmue>(B0->K*ll)", 0.045, 1.1),
    ("<Rmue>(B0->K*ll)", 1.1, 6.0)
)

L = FastLikelihood(name = "likelihood test", observables = my_obs)
L.make_measurement(threads = 4)

def my_LL(wcs):
    ReC9mu, ImC9mu = wcs
    par = flavio.default_parameters.get_central_all()
    wc = Wilson({"C9_bsmumu" : ReC9mu + 1j* ImC9mu, "C10_bsmumu" : -(ReC9mu + 1j*ImC9mu)},
                scale = 4.8, eft = "WET", basis = "flavio")
    return L.log_likelihood(par, wc)

C9contour_data = fpl.likelihood_contour_data(my_LL, -2, 0, -3.5, 3.5, n_sigma = (1, 2), threads = 4)

print(flavio.default_citations)

The final line just prints Straub:2018kue whereas Straub:2018kue,Straub:2015ica,Gubernari:2018wyi,Seidel:2004jh,Beneke:2004dp,Asatrian:2003vq,Greub:2008cy,Bobeth:2013uxa is what you should get (and do if you change either of the threads=4 to threads=1). As I say, my idea is this easy one-liner would work, and therefore require minimum effort for people to use, so my preference is to come up with some solution such that this does work.

Have you done any "realistic" tests of how much the Manager().dict() approach changes the overall runtime @peterstangl? You say it takes about 20 microseconds per write, how does that compare to the time taken for everything else? I'm just thinking if the overall effect is actually only a 1% increase that isn't actually that bad.

peterstangl commented 3 years ago

Have you done any "realistic" tests of how much the Manager().dict() approach changes the overall runtime @peterstangl? You say it takes about 20 microseconds per write, how does that compare to the time taken for everything else? I'm just thinking if the overall effect is actually only a 1% increase that isn't actually that bad.

%%timeit
flavio.sm_prediction("BR(Bs->mumu)")

yields around 470 μs with the conventional dict() and 570 μs with the Manager().dict() version. So in this case it's an increase by 20%... For the Array() version, one gets around 470 μs like with the dict()

For theory predictions that have to call register() many times, its even worse:

%%timeit
flavio.sm_prediction("<Rmue>(B0->K*ll)", 0.045, 1.1)

yields around 170 ms for both dict() and Array() but 240 ms for Manager().dict(), so the increase is 40%...

Concerning the regexp stuff used by the Array() version, this could be easily avoided if in addition to the register() commands inside the code, one would also have a YAML file that contains all the inspire ids that are used. This YAML file could be automatically generated/updated by a standalone script that uses the regexp stuff, but it could also be manually edited when new citations are added to the code.

MJKirk commented 3 years ago

Okay, yeah, that's too much of a slowdown. Darn.

It seems to me that the Array() + regexp solution, with some preprocessing as you suggest, makes for the most user friendly version -- from their perspective everything works automatically. People adding / editing code (might) have to worry about adding inspire ids to the YAML, but as you say that could be done semi-automatically by some script.

peterstangl commented 3 years ago

OK, I've now pushed a new version using multiprocessing.Array. The list of all possible citations is taken from a YAML file in order to avoid the potentially error-prone extraction using regexp in flavio's main routines. A standalone helper script is included that generates this YAML file from the source code using regex. This script should only be run by people editing the citations in the source code.

@MJKirk @DavidMStraub what do you think about this solution?

MJKirk commented 3 years ago

Generally I'm happy. I don't think it's a problem that you can only cite papers within the main flavio source code, since the "point" of this feature is to cite the source of flavio;s internals. (Although actually if you did want to for some personal reason, you could just add to the citations yaml file manually.)

In terms of some small improvements, should there be a nicer error message if you try and register a paper that isn't in citations.yml? Just like a reminder that new citations need to be added or the script needs to be re-run, whereas right now you just get KeyError.

peterstangl commented 3 years ago

Generally I'm happy. I don't think it's a problem that you can only cite papers within the main flavio source code, since the "point" of this feature is to cite the source of flavio;s internals. (Although actually if you did want to for some personal reason, you could just add to the citations yaml file manually.)

Well in principle it would also be easy to add a class method to flavio.citations that would allow the user to add references. The user-defined references just have to be added to the set flavio.citations.__all_citations and then a new Citations instance has to be created. Something like this could also be done in the context manager (e.g. by passing the user-defined references to collect()). But I'm not sure if that's actually necessary or if the feature should only be meant for citations in the flavio source code.

In terms of some small improvements, should there be a nicer error message if you try and register a paper that isn't in citations.yml? Just like a reminder that new citations need to be added or the script needs to be re-run, whereas right now you just get KeyError.

Yes, good point. I've just added this to the register method.

Apart from this, I think we still need some docstrings and some tests.

MJKirk commented 3 years ago

But I'm not sure if that's actually necessary or if the feature should only be meant for citations in the flavio source code.

Yeah, I'm of the opinion that that is all we need to support, so it's fine as is.

Yes, good point. I've just added this to the register method.

:+1:

Docstrings and some more tests I can work on in the next day or two

peterstangl commented 3 years ago

The Travis CI checks currently fail due to a problem with the new PyPI version of wilson. I have reported this already in issue https://github.com/wilson-eft/wilson/issues/67.

MJKirk commented 3 years ago

Ah okay. Sadly Github doesn't seem to have an emoji for "phew, I'm glad I didn't manage to break everything with one simple commit" :joy:

DavidMStraub commented 3 years ago

when you type :phew it suggests :disappointed_relieved: :disappointed_relieved: but I think :relieved: :relieved: would suit better :wink:

peterstangl commented 3 years ago

@MJKirk I've added a small commit that I still had lying around and that slightly improves the regexp in the extract_citation function. And now since wilson has been fixed, also all test have been passed successfully :grinning:

Is there still something you want to add? I think in principle, this PR looks quite good. @DavidMStraub what do you think?

MJKirk commented 3 years ago

No, there's nothing else from me, I'm happy for you to merge it.

Maybe I can just open an issue to list the couple of observables where I wasn't sure, for future reference rather than them being buried in the middle of this long PR? (It was ee->WW, neutrino trident, and perhaps B decay generally).

peterstangl commented 3 years ago

No, there's nothing else from me, I'm happy for you to merge it.

OK, done!

Maybe I can just open an issue to list the couple of observables where I wasn't sure, for future reference rather than them being buried in the middle of this long PR? (It was ee->WW, neutrino trident, and perhaps B decay generally).

Yes, I think it's a good idea to open a new issue for this!