NNPDF / reportengine

A framework for declarative data analysis
https://data.nnpdf.science/validphys-docs/guide.html
GNU General Public License v2.0
1 stars 2 forks source link

Set CPU affinity #20

Closed scarlehoff closed 4 years ago

scarlehoff commented 5 years ago

Python libraries (like tensorflow or numpy) have a tendency to open threads and processes with no measure. As a result running several tasks in parallel can be inefficient as there is a -unneded- competition for resources.

A solution to this is setting the CPU affinity of the resource-hungry program to ensure it doesn't spill over the rest of the computer. Can be easily done with a system call to taskset (with something like a force_single_core: true flag in the runcard)

Pros:

Cons:

@Zaharid, if you are not against it I can have a look at implementing it at the level of reportengine in a more-or-less more robust way. Right now when I need it I just hack the two lines I need in:

pid  = os.getpid()
subprocess.run(f'taskset -pc 1 {pid}')

let me know what you think (or if you know of a better way of doing this). I didn't want to do a PR directly because I am really not sure whether the pros are "pro enough" to compensate for the cons.

Zaharid commented 5 years ago

I am not sure I understand, so I'll write a collection of things that might be related and you tell me if I am making sense.

At the moment the execution of reportengine is horrible in that it never frees memory and runs in a single core by default. As it happens this has almost never been a serious issue in validphys so far, because the underlying cpp code tends to be so much less efficient for the typical workflow that the reportengine runtime could go and drink a few too many beers and complete the task the next day over a hangover. I don't see how the default single core mode has anything to do with cpu affinity.

Then there is the parallel mode, which comes with its own problems. The underlying algorithm is: construct a directed acyclic graph with the task dependencies, get the tasks that have no dependencies, execute them in parallel, and upon completion, add the tasks that become unlocked (if any) to a task queue, from which they are retrieved and executed (in undefined order) once a free cpu becomes available. The main problem with that for validphys is that the resources (~function arguments) are sent around by pickling them (i.e. the python native object serialization), which is typically not very efficient, and also not possible for cpp objects like fktables (this is one reason why the .loaad methods are cached instead of being managed as providers), so we have to load those in each cpu. This makes the parallel mode do a lot more work, and so it is not always a win. At the moment you need an fktable light and plot heavy scenario.

I have been thinking of various ways to improve on this, but have not been very motivated because like I said, we end up being limited by cpp most of the time. One possibility is to introduce some way of simplifying the task graph by collecting nodes that touch the same resources together, so the intermediate heavy objects never have to be serialized, or computed multiple times. Another thing that might go in the direction of what you are saying is to add a way to mark a provider as requiring several (or all) of the cpus for itself (I think you may get nasty deadlocks if that is not done properly, so it may have to be thought a bit) .

Zaharid commented 5 years ago

Ok, I have read it again and I think I understand better now (I did when I was almost done writing the previous comment but I didn't really want to delete it). I think something like this has to be a command line option (it is not quite like that at the moment, but the idea is that things that affect the "physics" go in the runcard and things that affect how stuff is computed go in the command line). The easiest is to then export the option as a resource, which at the moment uses an incredibly stupid API

https://github.com/NNPDF/reportengine/blob/e3c619862226210c187739616ec73317fc87087c/src/reportengine/environment.py#L115

(you have to subclass Environment and override ns_dump_description; it seemed cute at the time). and then have your functions take that argument. We do this in e.g vp-setupfit.

Unless I am missing something, I don't think you need to touch reportengine for this.

scarlehoff commented 5 years ago

At the moment the execution of reportengine is horrible in that it never frees memory and runs in a single core by default. [...] I don't see how the default single core mode has anything to do with cpu affinity. [...] Then there is the parallel mode, which comes with its own problems.

Yes, the problem I am describing comes at a different layer (and has a far simpler solution). Sorry it wasn't clear.

For instance, if you run only one action but that action happens to use tensorflow, it doesn't matter that you are running report engine in one single core, the moment tensorflow starts working it will try to use as many core as it wishes. Setting the CPU affinity at the level of report engine helps here because all children processes will only know about the one single CPU (or more) you let it know about.

[...] Unless I am missing something, I don't think you need to touch reportengine for this.

Just saw your second comment. Yes, I can do it at the level of n3fit.py like that but I don't know whether it makes more sense to have it integrated with report engine so it can be used in future actions that might also be using tensorflow-like libraries.

Zaharid commented 5 years ago

At the moment the execution of reportengine is horrible in that it never frees memory and runs in a single core by default. [...] I don't see how the default single core mode has anything to do with cpu affinity. [...] Then there is the parallel mode, which comes with its own problems.

Yes, the problem I am describing comes at a different layer (and has a far simpler solution). Sorry it wasn't clear.

For instance, if you run only one action but that action happens to use tensorflow, it doesn't matter that you are running report engine in one single core, the moment tensorflow starts working it will try to use as many core as it wishes. Setting the CPU affinity at the level of report engine helps here because all children processes will only know about the one single CPU (or more) you let it know about.

Is the cpu affinity setting preserved when you run a subprocess? I'd imagine it is when you call fork, but not sure about subprocess.

[...] Unless I am missing something, I don't think you need to touch reportengine for this.

Just saw your second comment. Yes, I can do it at the level of n3fit.py like that but I don't know whether it makes more sense to have it integrated with report engine so it can be used in future actions that might also be using tensorflow-like libraries.

If this is a global environment setting (either something you initialize or some sort of global resource), then it has to go in some Enviroment class. I don't have a strong option on whether is reportengine of validphys. But the default would be validphys, as reportengine should be as small as possible (ideally smaller than it is now...).

I should also say that I am slowly working on reportengine 1.0, which will break the world (hopefully not (m)any runcards and only specific parts of validhys), which has the main goal of having better code and clearer semantics, so you can add more stuff to it more easily..

scarlehoff commented 5 years ago

Is the cpu affinity setting preserved when you run a subprocess? I'd imagine it is when you call fork, but not sure about subprocess.

This is something I don't know actually but you can always enforce it in the children processes.

If this is a global environment setting (either something you initialize or some sort of global resource), then it has to go in some Enviroment class. I don't have a strong option on whether is reportengine of validphys. But the default would be validphys, as reportengine should be as small as possible (ideally smaller than it is now...).

Oh, I misunderstood. Yes, making it part of vp would also make sense.

Zaharid commented 5 years ago

Btw, another thing to watch out for is how the various libraries allocate concurrent tasks. You may end up in a situation where you get the worst of both worlds: many concurrent tasks and a single core. For example reportengine use multiprocessing.cpu_count() (and an override with an environment variable), which I don't believe cares about affinity and the like.

Zaharid commented 5 years ago

In fact I should change it. From the docs

https://docs.python.org/3.7/library/multiprocessing.html?highlight=process

multiprocessing.cpu_count()

Return the number of CPUs in the system.

This number is not equivalent to the number of CPUs the current process can use. The number of usable CPUs can be obtained with len(os.sched_getaffinity(0))
scarlehoff commented 5 years ago

Btw, another thing to watch out for is how the various libraries allocate concurrent tasks. You may end up in a situation where you get the worst of both worlds: many concurrent tasks and a single core. For example reportengine use multiprocessing.cpu_count() (and an override with an environment variable), which I don't believe cares about affinity and the like.

Yes, this is why I say is not always advisable to manually change affinity. You might end up in ugly situations. Tensorflow in particular (which is my interest and test case right now) does care about affinity to spawn new threads.

Zaharid commented 5 years ago

Yes, this is why I say is not always advisable to manually change affinity. You might end up in ugly situations. Tensorflow in particular (which is my interest and test case right now) does care about affinity to spawn new threads.

That supports the idea that this should start off as a setting specific to whatever fitting code cares about it and then become more general if we need it and understand how to do it. Note that setting the total number of cpus (which looks like an easier problem) is an open and active issue in numpy at the moment https://github.com/numpy/numpy/issues/11826.

Btw, one thing that would be welcome is a better interface for Environment that the subclassing of ns_dump_description nonsense. I am thinking that it could work as follows

class Environment:
    def provide_XXX(self):
        """Documentation of XXX."""
        return self.XXX #or something

And then have ns_dump_description read all relevant provide_ methods (probably implemented with __init_subclass__ on a base class) and generate the information based on <method>.__doc__. Should be fun.

Zaharid commented 5 years ago

@scarlehoff I was watching something boring the other night so I came up with:

class XBase:
    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        _nsfuncs = {**getattr(cls, '_nsfuncs', {})}
        for name in vars(cls):
            if name.startswith('provide_'):
                v = getattr(cls, name)
                if callable(v):
                    _nsfuncs[name[len('provide_') :]] = v
        cls._nsfuncs = _nsfuncs

    def ns_dump_description(self):
        desc = {}
        for k, v in self._nsfuncs.items():
            if hasattr(v, '__doc__') and isinstance(v.__doc__, str):
                s = v.__doc__
            else:
                s = ""
            desc[k] = s
        return desc

    def ns_dump(self):
        return {k: v(self) for k, v in self._nsfuncs.items()}

class X(XBase):
    def provide_x(self):
        """The description of x"""
        return "x"

which should be close enough.

Zaharid commented 4 years ago

I'm closing this because I think the way forward is clear now.