aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
437 stars 192 forks source link

Limit number of jobs in TOSUBMIT state #88

Open aiida-bot opened 10 years ago

aiida-bot commented 10 years ago

Originally reported by: Nicolas Mounet (Bitbucket: mounet, GitHub: nmounet)


The execmanager should limit the number of jobs launched at the same time (to avoid explosion of the number of jobs in the queue). If too many jobs are already in the TOSUBMIT state, no more jobs should be submitted. The limit has to be defined in some property when one sets up the computer. Then "verdi calculation list" should clearly indicate the jobs not yet submitted, with something like "TOSUBMIT (Waiting)".


aiida-bot commented 10 years ago

Original comment by Giovanni Pizzi (Bitbucket: pizzi, GitHub: giovannipizzi):


Some ideas from #43:

aiida-bot commented 10 years ago

Original comment by Giovanni Pizzi (Bitbucket: pizzi, GitHub: giovannipizzi):


Issue #43 was marked as a duplicate of this issue.

broeder-j commented 6 years ago

Some Ideas: By parsing another queue info command by the scheduler one could extract how many jobs are already in the queue on the computer and what the limit there is. if full everything else for this computer should be put in TOSUBMIT (Waiting)".

sphuber commented 6 years ago

This is part of a bigger issue #995 Discussion should take place in that issue

szoupanos commented 6 years ago

This issue seems too generic. It also not clear to me if there is a problem if many jobs stay in the TOSUBMIT state

ltalirz commented 6 years ago

This could be addressed in terms of the proposed SUBMISSIONONHOLD state, #270

broeder-j commented 6 years ago

I think issue #995 is different from this one.

This one is about making the daemon aware of queue limits. And then somehow deal with it. The daemon does/should know how many jobs are running on each machine. (In practice you want that AiiDA uses your whole hpc resources 24/7, therefore it has to be aware when things can be submitted and when not. This cannot be done on the workflow side. Currently this is done manually and with some tricks)

  1. maybe allow users to configure some DbAuthInfo attributes specific to queues, which can limit the amount of jobs for a certain machine/queue. (there where discussions about this and it is to get the actual number, or ensure that number)

  2. or parse it from another scheduler command that can be implemented.

broeder-j commented 6 years ago

To elaborate:

Currently I hack it this way: I limit the jobs with scheduler per machine to a hardcoded value. the rest stays in tosubmit. (to enforce a jobs queued limit) This value should come from somewhere, and should not be per computer, but per queue.

code snippet execmanager.submit_jobs_with_authinfo:

    ...
    qmanager = QueryFactory()()

    # hardcoded hack from jens for pgi:
    jobtotal_running = 100 # get this value from somewhere

    calcs_queued = qmanager.query_jobcalculations_by_computer_user_state(
            state=calc_states.WITHSCHEDULER,
            computer=authinfo.dbcomputer,
            user=authinfo.aiidauser)

    space_left = jobtotal_running-len(calcs_queued)
    if space_left <= 0:
        space_left = 0

    # I create a unique set of pairs (computer, aiidauser)
    calcs_to_inquire = qmanager.query_jobcalculations_by_computer_user_state(
            state=calc_states.TOSUBMIT,
        computer=authinfo.dbcomputer,
        user=authinfo.aiidauser, limit=space_left)

   # can be done better, sort at least for creation time, and do fifo

   # I avoid to open an ssh connection if there are
    # no calcs with state WITHSCHEDULER
    if len(calcs_to_inquire):
    ...
broeder-j commented 6 years ago

Also this solution can only account for user queue limits... But it at least allows for a 24/7 100% usage. And if anything fails, you can do a resubmit on the workchain level

sphuber commented 5 years ago

I was just rereading this and realize I do not understand why a scheduler has a job limit? What is the problem of submitting more jobs to the queue? At worst they will just end up in the queue and wait right. That is the whole point of the scheduler queue. It would make sense if it would be possible for the engine to, upon learning one scheduler's queue is full, to submit to another, or even another machine, but that will require changing the inputs of the calculation job, which is impossible. One needs to specify what machine and queue needs to be used upon calculation job launch time.

@broeder-j am I missing something?

unkcpz commented 4 years ago

I encounter a situation which may need to limit the number of jobs in scheduler. In a scheduler where no priority strategies are set, a newly submitted job has to wait until the previous queued jobs are all done. When I submit jobs using aiida, there are hundreds jobs wait in the queue state. This result in:

@sphuber Is there a good way to deal with the problem?

I notice that the number of submitted jobs is increase with daemon workers. In the aiida_core, is there an upper limit on the number of processing tasks which a worker can deal with?

sphuber commented 4 years ago

As of yet there is no built in solution for this but we are currently working on this. For now the best solution is to limit the number you submit yourself to the daemon. In your submission script you can put the submission in a while loop that before submitting checks how many active calculations there are currently and only submit more when needed

broeder-j commented 4 years ago

Hi, my five cents: schedulers have job limited for users and projects (user groups) due to performance and safety reasons. there are limits for running, and queued jobs (On the computing resources I use these are 100-1000). These can be read out with scheduler commands, but are set by the HPC/HTC resource administrations. User job limits on HPC resources are less problematic as project limits for obvious reasons. AiiDA does not know of these. If you hit them all submitted jobs further came back as failed. I don't know if AiiDA (scheduler plugins) is now smart enough to mark them as submission failed and try again, so not loose the job. I do not mind if thousands/millions of calculations are in my AiiDA queue, so I personally do not like to limit it there, also because I have control there. But for the throughput on the resources I did need to control how many get submitted (a year ago). Before AiiDA 1.0 this also speeded up the daemon performance.

To control the limit by your job calculation/workchain submissions to AiiDA might be hard, since you may not necessary know how many calculations a workchain produces when. Also I am to lazy to wrap every project in 'flow control code'

broeder-j commented 4 years ago

Also these limits might be quota dependent, i.e if you run out of quota the admins allow less jobs from you in the queue... running and queued.

giovannipizzi commented 3 years ago

A comment on this old issue.

I think we need first to reduce the memory footprint (see #4603, after re-doing the same tests in new new asyncio daemon), because anyway at the moment if you submit 1000+ calculations at the same time, they will all allocate a slot, and this will take quite some memory.

I think that the logical steps are:

  1. reducing the memory footprint of each process
  2. benchmarking that it becomes safe to have a much larger # slots/worker, and increase the default value
  3. then it becomes possible to submit many jobs at the same time to AiiDA, and we can think to a rate-limiting approach (this issue).

I think 1 and 2 need to be done anyway, to allow for higher throughput of processes if the compute node can cope with it. For 3, we need to think if it's instead easier to provide some functionality "external" to AiiDA, that helps keeping a max rate of submitted workflows (what I do in practice) - because in this way we don't have to go and touch the complex logic of the engine.

I'm thinking to something that helps checking if you can submit, otherwise sleeps for some time before submitting again. The check would be e.g. the current number of running jobs in AiiDA, the current number of running jobs in a given group, or some other simple recipe. Essentially, some simple recipe where you put your submit logic in a while True, you have some utility function to get how many more you can submit with simple rules (get_available_submission_slots(in_group=None, ...)), then submit the calculations you want to submit and sleep (e.g. 10 mins). If we manage to combine with a simple utility function to also get (in a general way) the calculation that have not yet been submitted (e.g. creating some 'fingerprint' of each calc in the extras, and checking if there is already a submitted calculation with that fingerprint in a given group before submitting), this would at least give a practical way to run a lot of jobs with some rate limiting (not perfect, but anyway what most of us do I think if we have to run a lot of jobs).

kYangLi commented 1 year ago

Hi all,

I've been using AiiDA recently to develop batch computation programs, and I must say, AiiDA is indeed an incredibly useful framework. It leverages the features of the Python language effectively, making high-throughput calculations remarkably straightforward. I wanted to share my thoughts on this particular issue.

I was just rereading this and realize I do not understand why a scheduler has a job limit? What is the problem of submitting more jobs to the queue? At worst they will just end up in the queue and wait right. That is the whole point of the scheduler queue. It would make sense if it would be possible for the engine to, upon learning one scheduler's queue is full, to submit to another, or even another machine, but that will require changing the inputs of the calculation job, which is impossible. One needs to specify what machine and queue needs to be used upon calculation job launch time.

@broeder-j am I missing something?

Regarding @sphuber's concern, I'd like to offer my understanding. I believe limiting the number of job submissions on computational nodes is a necessary feature. Imagine needing to compute certain properties for tens of thousands of materials using software like VASP, abinit, aims, etc. I initiate a "massive" computation task using AiiDA services in a Docker container, employing verdi run. What happens next is noteworthy. Initially, AiiDA might throw a stack overflow error due to Python's maximum recursion/stack limit of 999. Submitting several tens of thousands/hundreds of thousands of workchains in a short span inevitably leads to Python stack overflow. After resolving this issue by adjusting the verdi daemon worker count or dynamically tuning the stack limit, we encounter the next challenge: the remote PBS system might get overwhelmed. This is because PBS (or the other scheduling system) can't handle such an enormous number of job submissions. ( It's akin to launching a DDOS attack on the remote PBS server. ;-)

To prevent the latter scenario, we need a mechanism to throttle job submissions, such as halting AiiDA from submitting remote tasks to a machine once it detects a submission exceeding (2 * the queue capacity).

I've noticed that in the recent 2.4.0/2.4.1 releases of aiida-core , the first "maximum recursion depth reached" issue #4876 has been addressed by dynamically adjusting algo. However, I see this as a somewhat compromising solution. A more elegant approach to addressing the stack overflow issue would be to limit the rate at which users can submit workchains or introduce internal blocking mechanisms in AiiDA. For instance, stopping the clock for new process submissions once the execution of workchains reaches a certain threshold. The latter method may introduce some unpredictability in the results, as we can't foresee how users will design the call relationships between workchains/calcjobs. Hence, educating users on correctly using AiiDA to write high-throughput computational programs seems to be a more appropriate solution, perhaps by adding a section on "How to Properly Handle High Throughput Computing" in the How-To Guides.

In essence, users should be made aware that AiiDA is just a computational workflow framework, much like Django when establishing a website. Many context-dependent features require users to implement them within this framework. In my use case, I successfully implemented job submission blocking with the following approach, effectively resolving the stack overflow and "PBS under DDOS attack" issues:

# import nest_asyncio # Opting for this package presents an alternative solution.
from plumpy import ProcessState
...
    def calc_all_poscar(self):
        for _ in range(self.inputs.total_struc_num):
            # Barrier
            while True:
                qb = QueryBuilder()
                qb.append(CalcJobNode, filters={ 'attributes.process_state': { 'in': [ ProcessState.CREATED.value, ProcessState.RUNNING.value, ProcessState.WAITING.value ] } })
                busy_calcjob_num = len(qb.all())
                if (busy_calcjob_num < self.inputs.conf_max_num_nodes.value):
                    break
                time.sleep(60)
            # Require and calculate the POSCAR
            poscar = self._get_next_poscar()
            inputs = self.exposed_inputs(VaspAllChain)
            outputs = self.submit(VaspAllChain, poscar=poscar, **inputs)
            self.to_context(respond=append_(outputs))
...

Where cls.calc_all_poscar is a step in the WorkChain.

sphuber commented 1 year ago

Thanks for your detailed comment @kYangLi , much appreciated.

Initially, AiiDA might throw a stack overflow error due to Python's maximum recursion/stack limit of 999. Submitting several tens of thousands/hundreds of thousands of workchains in a short span inevitably leads to Python stack overflow.

First a quick correction: although you are right that until v2.4 there was a bug in AiiDA that could cause a RecursionError to be thrown, this was not due to the amount of calculations. Rather it was due to the fact that we monkeypatch the asyncio system and make the event loop reentrant (using the nest_asyncio library). It is this hack that would cause the call stack to become to deep if the AiiDA processes had many subprocesses. You could hit this error with just a single workflow, as long as this had a deep nested structure of subprocesses.

That being said, I do agree with your assessment that there still is another problem when launching many processes. Both AiiDA and the job schedulers to which they may get send, may get overloaded. This is a problem that has faced many users before, myself included. A common work around was to control submission with a wrapping script, which others have made into a reusable package (see the aiida-submission-controller). This is more or less a solution like you have suggested in your post above.

I think your solution is interesting and could be of use, but one should be careful because due to the while loop and the sleep, the daemon worker will be fully blocked during this step. The daemon worker that is running this workchain will be stuck in the calc_all_poscar step until all processes have been submitted. But for new VaspAllChain to be submitted, the others need to be finished first. But this daemon worker cannot do so, because it is blocking in calc_all_poscar. You would therefore need at least one more worker to take care of the others. Luckily this is easily done with verdi daemon incr, so as long as your machine can support multiple workers, this should still work.

Nevertheless, it is clear that this is a common problem and it would be great if AiiDA provided a built-in solution to limit the submission of processes. I have thought often about this, but unfortunately, due to the current architecture of AiiDA, implementing such a feature is not trivial. When we submit a process, we simply construct its corresponding node in the database and send a task to the process queue of RabbitMQ. RabbitMQ will send any new process tasks straight away to any daemon worker that is connected. There is no way to configure RabbitMQ to "throttle" how many tasks are sent out concurrently, so we cannot implement this throtthling on the RabbitMQ level. The alternative would be to have the throttling on the daemon worker. As soon as it gets a task to continue a process, it could check the current number of active processes and if it exceeds the number allowed, it could "refuse" the task and send it back. But RabbitMQ will immediately send it out again. You cannot tell RabbitMQ to "hold" the task for a while. So this approach would just lead to a ping-pong effect where RabbitMQ keeps sending out the task and the daemon worker keeps sending it back.

The only thing I could think of so far, is to have another process queue that is a "parking" queue. When a daemon worker gets a task from the normal queue but realizes that the maximum number of processes is active, it could maybe send it to another queue to park it. But then there needs to be a mechanism that will move that task to the normal queue again, once some other processes have finished and there are available slots again. Suffice it to say, that this won't be an easy implementation and there will be plenty of room for bugs that will cause processes to get stuck in the parking queue.

Anyway, maybe there are other solutions possible to integrate a solution in AiiDA. I would be very keen to hear your ideas because it would be a very valuable feature to have.

kYangLi commented 1 year ago

Hi @sphuber,

Thank you for your prompt response. First of all, I appreciate you pointing out the misconception I had earlier. In my previous experiments, I observed that submitting a large number of jobs inevitably led to the RecursionError. I mistakenly assumed it was related to the number of jobs, but as you rightly mentioned, the fundamental cause is the deep nesting of subprocess structures.

I understand the systemic difficulties in job control from the RabbitMQ side. From this issue raised since 2014, I can imagine that you have gone through numerous attempts to address it. Unfortunately, I haven't come up with a straightforward solution to this problem either.

Nevertheless, AiiDA's mature database management, convenient command-line interaction, and robust management system still make it an excellent choice in the high-throughput materials computation field. I once wrote a simple high-throughput computing program in Ruby, but I never thought of utilizing Python features as comprehensively as AiiDA does. The approach of constructing a workflow tree is akin to designing logic for classes, making AiiDA very developer-friendly and the entire high-throughput workflow modular and clear.

Previously, following the mindset of "AiiDA is all you need," I manually introduced a blocking mechanism within the Workchain. Honestly, this approach is only barely functional. Through extensive testing, I learn that making a step in the Workchain outline a background resident process may contradict the original design intent of Workchain outlines. For instance, restarting the verdi daemon will initializes all updates of self.ctx in unfinished outline step, causing unexpected errors; and operations like sleep can delay verdi process kill, etc. What I want to say is, my approach makes the AiiDA service unstable, introducing additional constraints, such as the requirement for "worker quantity greater than 1" that you mentioned. While it might be suitable for personal use, I strongly discourage its use in a production environment.

AiiDA-DS drawio

Coincidentally, some of my recent ideas align with the implementation approach you provided for the aiida-submission-controlle. My new approach is illustrated with the diagram above. Based on the existing features of aiida-core, I find the following high-throughput computing method inappropriate: constructing a large high-throughput task as a massive WorkTree and then controlling and scheduling jobs internally in the workchain. The tight integration of RabbitMQ with workers and the complexity limitations of the process tree make aiida-core more suitable for submitting individual jobs for each material / computation task.

In other words, the current features of aiida-core make it more suitable as a material calculator (small-scale scheduling) rather than a large-scale job scheduler. We feed it one material at a time, and it performs the corresponding calculation based on its internal task tree and machine node configuration, returning the desired results. Large-scale job scheduling tasks can be handled in a module named aiida-navigator. This module reads information from aiida computer nodes and the material pool, gives judgments on whether to submit aiida-core computing tasks conditionally, and aiida-core automatically fetches a structure from the material pool for computation. By executing the aiida-navigator periodically (scheduled in crontab or using a bash / python script), we can achieve real-time updates of new materials from the material pool into AiiDA's computational chain.

Apart from some compromises, this approach has several advantages:

1) Each total workchain tree in AiiDA is responsible for calculating a single material, preventing the computation tree from becoming too large and deep. 2) The outer navigator can handle how materials are fetched from the material pool, the form of the material pool (database, web API, local folder), or even how to balance HPC loads. The inner workchain only needs to focus on its own calculations. 3) Since each task is independently submitted, even if polling is suddenly interrupted (due to power outage or other unstable factors), the loss incurred is minimal.

Having said that, I revisited the implementation principles of aiida-submission-controlle and found it surprisingly aligns with the description above. Thanks the authors, and Bravo ;-). (Perhaps next time before putting hands on keyboard, I should conduct a more detailed review.)