RFE: Add -wait option to pqact PIPE actions

akrherz commented 9 years ago

Hi Unidata/Steve,

I would really like to see a "-wait" or equivalent option added to PIPE/EXEC actions to effectively limit the number of processes one pqact could have active at one time. This flag would cause pqact to not recycle the slot until that process has exited. I think I discussed this with you many years ago and you were not enthusiastic about it as misbehaving/naughty processes could wedge up and effectively jam up pqact as well as pqact waits for these processes to exit...

The issue is that any process that has this one product in execution model, could effectively DOS a system as pqact exec's off one process per product received. Starting up LDM after a considerable downtime is one example. Another is some products that come in rapid succession...

Currently, users have two options:

Allow their process to handle more than one product on stdin, effectively making it long running
Add some locking mechanism that checks to see if others like it are currently running and then sleeps for a bit waiting for those to exit.

I personally loathe option 2 as having potentially hundreds of scripts writing lock files and sleeping is a race condition waiting to happen. I have written lots of processes that do option 1, but not all are well suited for it. For example, satellite data processors.

A nice aspect of this is that pqact could then log non-zero exit statuses from these '-wait' processes, which would help users debugging this. Perhaps some other logging would already kick in, if pqact had no available slots over some given about of time, I am unsure of that one.

I think a reasonable exception is for '-wait' to imply a '-close' as well. I'd be happy to provide feedback if there are other edge cases you anticipate. Thanks for your consideration :)

semmerson commented 9 years ago

@akrherz I'm not sure I understand what you're asking, exactly. So I'll explain a few things and then ask a question.

The number of active decoder processes that a pqact(1) process can have is limited by the maximum number of open file-descriptors a process may have.
The ulimit(1) utility can be used to lower the maximum number of open file-descriptors.
pqact(1) doesn't recycle a file-descriptor
- Unless the -close option is specified; or
- The associated decoder process exits; or
- The maximum number of file-descriptors is open and a new file-descriptor must be opened, in which case the least-recently-used one is first closed.

What, exactly, is the problem you're encountering? What are the symptoms?

akrherz commented 9 years ago

@semmerson Thanks for your response. I thought pqact had a 32 pipe/exec/file limit for number of 'child' (bad word, but I can't think of a more proper one) processes. So if pqact has 32 active things going and LDM receives an additional product that requires a 'new' process, it will look to recycle one of those 32 active things and close its PIPE/IO to that process. If that process is still doing computation, then pqact could effectively launch many such processes. This was sort of based on our discussion we had about the need to split NEXRAD2 into multiple pqacts to get <32 radars per pqact.

Lets try an example pqact entry:

IDS|DDPLUS      /p((MAV|MEX|MET)...)
        PIPE    -close  -strip  python pyWWA/parsers/split_mav.py

When a rapid succession of MOS products arrive on LDM, I have seen pqact have tens to perhaps 100s of these split_mav.py processes going at once. I would like to have the option of pqact -waiting for this split_mav.py to complete before attempting to fire up another. Yes, I could modify this code to many it accept multiple products in one iteration, but this is a modest example. The satellite data processor I have is a more compute intensive issue I have.

The ulimit option probably won't work as I don't want to choke down on ldmd as well, but perhaps I could make that option work. I also have other LDM processes running that need many file descriptors as well, which would be tricky to account for.

semmerson commented 9 years ago

@akrherz 'Child' process is exactly the right word.

The maximum number of open file-descriptors is revealed by the command ulimit -u. The minimum value is 32 according to the Unix standard.

Python can be orders of magnitude slower than C. See this revealing graphic.

Why do you want to limit the number of active split_mav.py processes?

akrherz commented 9 years ago

@semmerson Seriously? You deem it necessary to point out that python is slower than C? I am some sort of idiot here? Just close the ticket, enough.

semmerson commented 9 years ago

@akrherz I'm sorry I upset you. I'm just trying to understand your problem and consider all options. I only pointed-out that Python is relatively slow because the number of LDM decoders in your scenario will depend on their speed; consequently, faster decoders is an alternative solution to your problem. It might not be the best solution, however. I'm certainly open to a -wait option for the PIPE action; I just don't know if its the best solution due to its side-effects -- which is why I'm writing out loud (so to speak).

I assume the number of split_mav_py processes on your system can be a problem. Is this because it reduces interactivity to an unacceptable level?

akrherz commented 9 years ago

@semmerson The temp file locking/sleep code that chiz wrote because of this issue was against GEMPAK processes, so this issue is not limited to Python or other 'slow' processes.

I did some more investigation of this and see there is no 32 FILE/PIPE/EXEC limit as a thought, okay, I am educated on that point!

I found our previous email discussion on this, back in Nov 23 2011, the request was to add -wait to PIPE so that exit statuses would properly be reported via LDM. Your last email on it was

It wouldn't be different -- and that "-wait" option for EXEC is very
dangerous. It's only redeeming quality is that EXEC actions tend to be
locally contrived ones -- so the user has explicit knowledge of how long
they'll take -- whereas the PIPE actions are often ((if not usually) to
third-party decoders that the user didn't write and doesn't know how
long they'll take.

akrherz commented 9 years ago

@semmerson I had a bad thought about this. Would adding such an option cause pqact to 'block' as it waited for that -wait process to finish, so that it can PIPE a new product to it? It would not be able to do any other actions until this -wait process finished? If so, this would not be a good idea!

semmerson commented 9 years ago

My comment on the difference between a -wait option for EXEC and PIPE actions is correct.

Yes, a -wait option for the PIPE action would cause pqact(1) to block and not process any more data-products in the interim. This is the unwanted side-effect I mentioned previously. It might be possible to mitigate this effect by restricting the pqact(1) process to only those data-products for which the -wait option would be appropriate. (Yet another thing to consider.)

What, however, is the problem caused by a relatively large number of split_mav_py processes on the system in question?

akrherz commented 9 years ago

@semmerson Ah, yeah, I am not getting warm fuzzies about this request anymore.

When each split_mav.py process makes web service API requests to a service provider that only allows a certain number of simultaneous requests, I need to be able to limit the number of simultaneous split_mav.py processes that can be active at any time. In the case of GEMPAK, there are shared memory issues / bugs with number of simultaneous GEMPAK processes that can run at one time under one user.

I am fine with rejecting this RFE given the blocking issue you just commented on, that would be a problem.

semmerson commented 9 years ago

If the problem is that a service external to the LDM can only handle a limited number of requests, then a possible solution would be to have the relevant decoder use a semaphore that was initialized to this number. Can this be done in Python?

akrherz commented 9 years ago

@semmerson Sure.

semmerson commented 9 years ago

That might be the best solution, then, because it wouldn't cause pqact(1) to block yet access by the decoders to the limited resource would be controlled.

Unidata / LDM

RFE: Add -wait option to pqact PIPE actions #34