Closed rmandelb closed 6 years ago
Searching for that exception, I get comments about the main job ending before all the workers are finished, which doesn't make much sense to me, but maybe there is a bug in how the multiprocessing is working that just hasn't materialized before. Not sure.
Could you post (or send me) the config file you're using? That will help give me the context to understand where this is coming from. (I assume you are using the multiple real_galaxy inputs that we discussed yesterday.)
Also, did you try running in with nproc=1 to make sure that there's not some other bug that is just being hidden behind this cryptic message when it happens in a multiprocessing context?
It might also help to look through the prior log for error reports that could be relevant (maybe running with higher verbosity). Possibly something else is happening earlier that would explain why the pipes are being shut down before the workers are done.
@rmjarvis - Thanks. I just sent you the config file. important: use the one from the second e-mail (the first one was from before I debugged; I hadn't pulled the fixed version to my laptop before sending)
I did not try running with nproc=1 until now, but I just set up one of the jobs that failed to run with output.nproc=1 and image.nproc=1, and higher verbosity.
Oh crud, I think I might have realized part of the problem.
In the past, I used to run on nodes that had 16 cores. When submitting each job using PBS, I would tell it to give the job 16 cores. In practice, that meant PBS would assign each job to a different node, and each job would therefore have exclusive use of all 16 cores. GalSim would recognize there were 16 accessible cores on the node that the job was assigned to, and automatically use nproc=16 (I did not set nproc explicitly).
This time, I tried using some higher-memory nodes, but those have 64 cores. I was still submitting in a way that told PBS that I only need 16 cores per node, so it would assign 4 of my jobs per node. However, GalSim would recognize 64 cores on the node, and each of those jobs was told to use all 64 cores. Could that be part of the problem? Or should it be able to handle that situation gracefully?
(comment above edited for clarity)
Using nproc = -1 is supposed to be when you have the whole node. (On the bnl system, I usually do "bynode to make sure I get the whole node, whether it has 8 or 12 cores/node.) So it sounds to me like it's probably getting overloaded running 64 x 4 processes and something terrible is happening.
Probably there should be a more graceful way to handle the crash (maybe wrapping the main part of the MultiProcess
function in process.py in a try/except/finally block and make sure to join all the workers before reraising the exception).
I'll think about how I can try to mock up a processing failure and try to make sure that works. Do you see in the output what the original error was? It probably got caught and reported, but then the code kept going, so if you look through the output, there might be something about an out of memory error or something like that that killed a process?
But I don't think I can have it automatically detect that there are 3 other GalSim jobs running and only use ncpu/4 on each. You'll need to either set the number of processes to 16 for each. Or make sure you get the whole node.
Yep, nproc=-1 was just fine when I was running on our 16-core intel nodes. I just forgot that I had used nproc=-1 when I started submitting to these other nodes. User error. :) It would be nice if this could be handled more gracefully/obviously, but this was definitely my fault.
In the meantime a bunch of nodes have opened up, and I've started rerunning in a way that should be safe with respect to not over-using CPUs. Will have to monitor the situation.
Do you see in the output what the original error was? It probably got caught and reported, but then the code kept going,
From what I can tell, this is not happening. In general there are these statements like Object 1141194: Too many exceptions/rejections for this object. Aborting.
but in the output, there is no prior mention of the object in question. Maybe I will learn more in the new job I launched with nproc=1 and a higher level of verbosity, but maybe not, since the error probably won't occur in that case.
I think the Too many exceptions was when it started hitting that IOError and retrying, but it never fixed itself, so it hit the max_retry limit and just aborted. I was hoping there was some error before those that got reported, but maybe things just died silently (at least as far as Python could tell).
I think I fixed it to at least report the error more sensibly. I added the try/except/finally block I suggested and ran demo9 with a hand-coded error after it received the result for file==1. Here is what it looks like now:
$ galsim demo9.yaml
Using config file demo9.yaml
Using 8 processes for file processing
Start file 0 = output_yaml/nfw1/cluster0000.fits
Start file 1 = output_yaml/nfw1/cluster0001.fits
Start file 2 = output_yaml/nfw1/cluster0002.fits
Start file 3 = output_yaml/nfw1/cluster0003.fits
Start file 4 = output_yaml/nfw1/cluster0004.fits
Skipping file 5 = output_yaml/nfw2/cluster0000.fits because output.skip = True
Start file 7 = output_yaml/nfw2/cluster0002.fits
Start file 8 = output_yaml/nfw2/cluster0003.fits
Start file 6 = output_yaml/nfw2/cluster0001.fits
Start file 9 = output_yaml/nfw2/cluster0004.fits
Process-8: File 7 = output_yaml/nfw2/cluster0002.fits: time = 4.313154 sec
Caught a fatal exception during multiprocessing:
OSError('Failed somehow.',)
Traceback (most recent call last):
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 884, in MultiProcess
raise OSError("Failed somehow.")
OSError: Failed somehow.
Traceback (most recent call last):
File "/home/mjarvis/bin/galsim", line 275, in <module>
main()
File "/home/mjarvis/bin/galsim", line 256, in main
except_abort=args.except_abort)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 717, in Process
except_abort=except_abort)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/output.py", line 144, in BuildFiles
except_abort = except_abort)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 914, in MultiProcess
raise raise_error
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 884, in MultiProcess
raise OSError("Failed somehow.")
OSError: Failed somehow.
This is python 3, so the previous behavior was slightly different from what you saw, but I think equivalent. Lots of crap along the lines of:
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/output.py", line 454, in buildImages
image = galsim.config.BuildImage(base, image_num, obj_num, logger=logger)
Traceback (most recent call last):
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/image.py", line 230, in BuildImage
image, current_var = builder.buildImage(cfg_image, config, image_num, obj_num, logger)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/image_scattered.py", line 117, in buildImage
self.nobjects, base, logger=logger, obj_num=obj_num, do_noise=False)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/stamp.py", line 106, in BuildStamps
except_func = except_func)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 921, in MultiProcess
except_func(logger, None, k, e, tr)
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/stamp.py", line 96, in except_func
logger.error(s0 + 'Exception caught when building stamp %d', obj_num)
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 350, in error
if self.logger and self.logger.isEnabledFor(logging.ERROR):
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 801, in worker
logger.debug('%s: Caught exception: %s\n%s',proc,str(e),tr)
File "<string>", line 2, in isEnabledFor
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 338, in debug
if self.logger and self.logger.isEnabledFor(logging.DEBUG):
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/managers.py", line 716, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "<string>", line 2, in isEnabledFor
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/managers.py", line 716, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 801, in worker
logger.debug('%s: Caught exception: %s\n%s',proc,str(e),tr)
File "/home/mjarvis/lib/python3.5/site-packages/galsim/config/process.py", line 338, in debug
if self.logger and self.logger.isEnabledFor(logging.DEBUG):
File "<string>", line 2, in isEnabledFor
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/managers.py", line 716, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/mjarvis/anaconda3/lib/python3.5/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
So if you're willing to try the failing behavior again, maybe you could confirm that things at least seem to be telling you the underlying problem more appropriately.
On branch #876, btw. cf. 08a8349
Thanks. I actually think there is a bit more going on here, because I noticed that one of the nodes that has 20 processors (and just one of my jobs taking all 20 CPUs, nothing else on there) was stalled. It was supposed to produce 80 subfields, so it farmed them out to separate CPUs to produce 20 images. 18 of the 20 were produced in about 2 hours, and there has been no output in the subsequent 16 hours. I logged directly into the compute node where this job was running, and found ~35 galsim processes running, but all but two of them were stalled. So it seems like it was just sitting around waiting on the last 2 rather than starting more images. Is that supposed to happen?
I did restart several of the processes with the version of GalSim on branch #876, in hopes that either it'll start working, or we'll get better diagnostic outputs in case it's still having whatever exception thrown that caused the trouble.
Even worse: one of the jobs on a node with 16 CPUs was producing no output at all after 6 hours. I logged directly into the compute node and found 16 galsim processes that were all frozen as of several hours ago, so when I used "top" there was nothing active at all. So I don't think this could have been a matter of it waiting for just one process to finish. Here nothing was active yet it was just sitting around.
@rmjarvis - One of the jobs in which I used the new branch and had a higher verbosity just died. The end looks like this:
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 1/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 2/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 3/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 4/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 5/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 6/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 7/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 8/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 9/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 10/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 11/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 12/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 13/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 14/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 15/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 16/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 17/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 18/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
This is try 19/20, so trying again.
Object 790000: Caught exception [Errno 32] Broken pipe
Object 790000: Too many exceptions/rejections for this object. Aborting.
Exception caught when building stamp 790000
Traceback (most recent call last):
File "/home/rmandelb/software/lib/python2.7/site-packages/galsim/config/process.py", line 922, in MultiProcess
result = job_func(**kwargs)
File "/home/rmandelb/software/lib/python2.7/site-packages/galsim/config/stamp.py", line 324, in BuildStamp
prof = builder.buildProfile(stamp, config, psf, gsparams, logger)
File "/home/rmandelb/software/lib/python2.7/site-packages/galsim/config/stamp_ring.py", line 97, in buildProfile
gal = galsim.config.BuildGSObject(base, 'gal', gsparams=gsparams, logger=logger)[0]
File "/home/rmandelb/software/lib/python2.7/site-packages/galsim/config/gsobject.py", line 187, in BuildGSObject
gsobject, safe = build_func(param, base, ignore, gsparams, logger)
File "/home/rmandelb/software/lib/python2.7/site-packages/galsim/config/input_real.py", line 38, in _BuildRealGalaxy
galsim.config.SetDefaultIndex(config, real_cat.getNObjects())
File "<string>", line 2, in getNObjects
File "/opt/python27/lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe
So... just a broken pipe over and over again for the very first object in the file?
... but when I searched for exceptions earlier, there are sections like this:
Object 390104: Caught exception [Errno 104] Connection reset by peer
Object 450124: Caught exception
This is try 2/20, so trying again.
Object 160194: Caught exception [Errno 104] Connection reset by peer
Object 360102: Caught exception
Object 300266: Caught exception got end of file during message
Object 200230: Caught exception
Object 380072: Caught exception
This is try 1/20, so trying again.
Object 290210: Caught exception got end of file during message
Object 410206: Caught exception got end of file during message
Object 430096: Caught exception
Object 250236: Caught exception [Errno 104] Connection reset by peer
Object 420082: Caught exception got end of file during message
Object 220160: Caught exception [Errno 104] Connection reset by peer
And after that it's just a storm of errors of that sort along with the broken pipe one.
I've been googling and I don't understand any of these related cases. It seems like they can happen when doing some kind of ssl connection. So maybe your cluster is doing something like that to access the real_galaxy_catalog? And for some reason the connection can go down and make everything start crashing? Not really sure.
You could try running with verbose=3. And maybe first add some logger.debug statements in the _BuildRealGalaxy
function in input_real.py, since that's my best guess for where this might be crashing.
BTW, once it gets a broken pipe, the behavior you're seeing where it retries 20 times and then fails makes sense. I just don't know why the pipe is breaking.
Hmmm. Well, two of the catalogs live on a part of the file system I don't normally use for storing data. That disk is connected to the compute nodes through a fast infiniband connection, and I wouldn't think that involves ssl but maybe I am wrong. I will ask the helpdesk. In the meantime, I am moving the catalogs to the disks where I normally keep my data, and will try restarting the jobs. I'll do at least one of them with verbose=3. So - if it all works after this move, it's fine. If it still doesn't work after this move, at least I will have a job with verbose=3.
Update: I launched all the jobs last night, and now I have a new problem: basically all the jobs have gone into zombie mode (they are still nominally running according to PBS, but when I log into the node, none of the processes is running at all). As an experiment, I killed one of the jobs through PBS. When that happened, the output file indicated that nothing had happened since 7 hours earlier (i.e., I killed it at 8:45am and the output file had a time stamp of 1:44am). In the output file, there are no exceptions whatsoever. So basically all the processes are stopping running, but the parent process that launched them doesn't seem to recognize it and just sits there forever. This seems like a multiprocessing error. There are a surprising number of rejections/rebuilds considering the very lax rejection criterion I've been using, so I need to dig into that. For completeness here were the obvious differences between the old jobs I ran using this framework and the new ones:
Could multiprocessing issues have crept into GalSim since v1.4? Or could it be some interaction between multiprocessing and RealGalaxyCatalog? (have you ever used that for large-scale sims before? I have not)
In the meantime, here's what I'm trying: I'm running a version that doesn't run the rejection module at all. And I'm considering switching the sims entirely from multiprocessing the images to multiprocessing the postage stamps within the images, in hopes that this avoids whatever is causing the issue with multiprocessing the images.
@rmjarvis - follow-up question here. Part of the issue from what I can tell is that some images are finishing in a kind of normal amount of time and others are taking much longer for reasons that I can't tell. If I have 16 processes, each doing one image, and 15 images are done in 2 hours but another is still going 8 hours later, the 15 cores that did the first 15 images will just be sitting idle, rather than being assigned to do another set of images. Is there a better way to do things in this context?
It's hard to say. If you know that one file will take much longer than the others, then maybe running in image.nproc
mode would be faster, doing one file at a time, but multiprocessing across stamps. You'll take a hit in interprocess communication, but maybe that's ok to make sure you don't leave 15 cpus idle at the end.
On the other hand, I usually run many more than just 16 files. Usually I'll set up the jobs (i.e. using -n njobs -j jobnum) so that each job has enough files for each core to make a few of them. Then if one (at least one of the earlier ones) takes a long time, the others go on to other files and make them while that one is still working on its first file. That will be efficient until all the files are started by some core, after which you'd just have to wait until they all finish.
But I think the fundamental problem here is understanding why some of the files seem to hang and not make further progress. I don't have a good guess about that. Although the short answer to your question about whether a bug could have crept in since 1.4 is certainly yes. I just don't have any idea what it might be.
Maybe try switching to a cgc run to see if it is a problem with the real galaxies somehow. Or maybe turn off the switching among rgc inputs for the different subfields. That feature isn't much used so maybe there is a bug in it that hasn't cropped up previously.
Another idea: you could switch off multiprocessing entirely and just let the job queue do that for you by making lots of jobs with different job numbers in the -n njobs -j jobnum
usage and ask for 1 core per job.
I thought about your last idea, but here's my question: in the current mode of working (16 core multi-processing using output.nproc=-1), how many copies of catalogs etc. are kept in memory? If there's just one copy that all the cores can access somehow, then this means I can't do it the way you suggested, because I'm already using 1/2 the RAM on the node -- and if I have 16 separate processes each with the catalogs in memory then that's way too much RAM. But if all processes already have copies of things in memory the way I'm doing it now, then it's fine to try it the way you suggested.
On the other hand, I usually run many more than just 16 files.
I haven't been explaining well, clearly! My jobs are for 80 files each, and the nodes have 16 cores. So it farms out the first 16 images to 16 cores, but then some of those cores are taking a long time, and it's seemingly not re-assigning the other cores for which the image processing finished sooner - unless I'm misunderstanding something.
Do you have suggestions for a way to set up a quick debug cycle? I can try other things like not having lists of catalogs, reverting to an older version of GalSim, and so on - but the problem I'm having is that if I make a change like reduce the # of galaxies per image by a factor of 100, I don't hit the problem at all. So it's taking many many hours just to ascertain whether things are running or what is going on.
I'm slightly confused between the terms file and image. Are your output files 1 image each? If so, then with 80 files, once one is done, it should write it to disk and then move on to the next one. There is no reason it should wait for the other 15 processes that started files at the same time.
Are you saying that 15 files get written to disk, but those 15 processes don't start the next file? That seems very strange. (As are the other symptoms you're describing, so maybe I shouldn't be so put off by it.)
As for the quicker debug cycle, probably the problem is not with the psf, so switching that to a Gaussian would speed things up. (InterpolatedImage is rather slow.) Also turn off whitening, since that's probably not relevant, so just taking time. Could also use larger pixels to make the FFTs faster. (stamp_size = 8, pixel_scale = 1.2 for instance)
Are you saying that 15 files get written to disk, but those 15 processes don't start the next file?
Yes.
Ironic update: I decided that in the interest of using a better-vetted version of GalSim, I would go to releases/1.4. So I did that and launched some jobs last night. They died with the first type of error I was describing in this issue (broken pipes, IOErrors), which I thought were gone for good now that I moved the data to my usual place. When I saw that, I realized that the jobs with those problems last night and before that had one thing in common: I was using a different set of compute nodes than usual by submitting to a different queue. I have never used that queue before. I suspect the nodes have a wonky connection to the disk (any disk). So anyway, I am trying again on the usual nodes, the ones where I've never seen that error, and we'll see if the second type of error (the "continuing indefinitely with almost no output" problem) happens. If they do, then I will do more drastic restructuring of my script (no list of catalogs, no multiprocessing, etc.).
More information:
With releases/1.4 (so I guess corresponding to v1.4.2), and without using the problematic nodes, things have been chugging along... mostly.
I noticed that a few jobs had not produced output in about 12 hours, while other jobs were slowly but steadily producing output. When I killed the jobs that hadn't produced output in about 12 hours, the PBS output files - the ones that catch all job output and error messages - (a) did not have any exceptions in them, it was just the usual-looking output from a big config run with verbosity 1, and (b) had time stamps from about 12 hours ago. It's as if the job really stopped 12 hours ago for unknown reasons, but didn't exit. Very very strange.
So this is there in v1.4, and doesn't have to be associated with an error. It seems like it's some strange multiprocessing problem.
Since some jobs are chugging along and there are no obvious errors in the output, I'm going to just keep going and resubmitting jobs, I think.
@rmjarvis - the last comment on this issue has the latest (the problem is clearly there in v1.4). I basically brute-forced my way through the problem, killing jobs that seemed to be lingering for no reason with no output. I don't have time to seriously investigate the possible causes (is it a bug associated with multiprocessing RealGalaxy's, vs. something to do with multiple catalogs) at the moment. My suggestion is as follows:
Sound OK?
OK by me. It's a minor enough change, it can probably be merged in directly. But if you want a PR for it, I can do that too. Before we give up just yet though, let me ask a couple more questions.
Do you know what is the last debug logging that gets printed for the ones that stall out? I think it was right around when they are writing the output file. And I think you said the output files did get written. So does it get to the line:
logger.debug('file %d: Wrote %s to file %r',file_num,output_type,file_name)
?
After that, it should write the truth file, so did those get written properly?
Then it should output a logger statement saying it finished that file:
logger.warning(s0 + 'File %d = %s: time = %f sec', file_num, file_name, t)
(s0
should be the process number prefix, so something like "Process-3: ".)
The next thing normally would be for the process to report that it grabbed the next job:
logger.debug('%s: Received job to do %d %ss, starting with %s', proc,len(task),item,task[0][1])
But I think that's probably not happening based on what you've said so far.
I'm closing this in the (probably unreasonable) hope that the various changes to the config layer over the past year have magically fixed whatever was wrong here. But if you (or anyone else) sees this recur and has additional clues to what might be going wrong, please do reopen.
Hi @rmjarvis - I launched a job to generate a bunch of sims last night, and a number of the jobs died with rather cryptic errors like this:
Have you ever seen this and/or do you have any idea what this is?