Closed IntegerLuoHua closed 2 years ago
I think I may find the answer.
I should modify SIRISAACDIR!!!
Yes, apologies that the current parallelization setup is a bit hacky. Changing SIRISAACDIR should work for now—this will be done automatically during installation in a future release.
This is not fixed for me by changing SIRISAACDIR. I'm getting following error:
Failed to import libsbml.
SBML import and export not available.
Failed to import pygraphviz. Network figures unavailable.
SloppyCellFittingModel.fitToData: generating ensemble for these parameters: ['log_beta_0', 'g_0_0']
generateEnsemble_pypar: Generating parameter ensemble with 1000.0 total members, using 2 processors.
Traceback (most recent call last):
File "/home/sid/pyProjects/SirIsaac-develop/simpleExample.py", line 150, in <module>
p.fitAll()
File "/home/sid/pyProjects/SirIsaac-develop/SirIsaac/fittingProblem.py", line 845, in fitAll
FittingProblem.fitAll(self,**kwargs)
File "/home/sid/pyProjects/SirIsaac-develop/SirIsaac/fittingProblem.py", line 217, in fitAll
fittingDataDerivs=fittingDataDerivs,**kwargs)
File "/home/sid/pyProjects/SirIsaac-develop/SirIsaac/fittingProblem.py", line 1619, in fitToData
dataModel,initialParameters,verbose=self.verbose)
File "/home/sid/pyProjects/SirIsaac-develop/SirIsaac/fittingProblem.py", line 2643, in generateEnsemble_pypar
+ " error in generateEnsembleParallel.py"
Exception: generateEnsemble_pypar: error in generateEnsembleParallel.py
generateEnsemble_pypar error:
Process finished with exit code 1
Please help.
Hello @sidambhire — thanks for your question.
Did you try running with numprocs = 1
? Does that produce any errors?
Hello @bcdaniels I tried running it with numprocs = 1. There were no errors at least in the beginning. It was running too slow so I did not let it run all the way till the end.
I also tried to dig in to the code. I think the error most probably is in "generateEnsembleParallel.py": most probably in the way the data is passed to the MPI processes. How do I get more info about the error from the MPI processes?
Hmm, yes, it seems that there is an error in one of the parallel processes, but only when you run in parallel.
If you feel comfortable with debugging a bit, you might try this:
1) Interrupt the code in fittingProblem.generateEnsemble_pypar
by adding, say, a call to exit()
just before the line that says # call mpi
(around line 2623).
2) Running your simpleExample.py
code should now produce the temporary file that gets passed to the MPI processes, which will be called something like temporary_XXX_generateEnsemble_pypar_inputDict.data
.
3) Then you can try to run mpi from the command line yourself by typing the following (replace the last part with the actual .data
filename for your system)
mpirun -np 10 python SirIsaac/generateEnsembleParallel.py temporary_XXX_generateEnsemble_pypar_inputDict.data
With any luck, this will give you more information about the actual error that's happening in the parallel code. Let me know if you encounter any other trouble here.
Hi @bcdaniels , I did the debugging steps you asked for. Here's the relevant part of my output:
Traceback (most recent call last):
File "SirIsaac/generateEnsembleParallel.py", line 38, in <module>
inputDict = load(inputDictFile)
File "/home/sid/pyProjects/SirIsaac/SirIsaac-develop/SirIsaac/simplePickle.py", line 21, in load
obj = cPickle.load(fin)
ImportError: No module named SirIsaac.fittingProblem
I added following lines to the code in simplePickle.py
at the begining:
sys.modules['SirIsaac.FittingProblem'] = fittingProblem
sys.modules['SirIsaac.fittingProblem'] = fittingProblem
It gave me following error when I ran mpirun
command:
Traceback (most recent call last):
File "SirIsaac/generateEnsembleParallel.py", line 38, in <module>
inputDict = load(inputDictFile)
File "/home/sid/pyProjects/SirIsaac/SirIsaac-develop/SirIsaac/simplePickle.py", line 23, in load
obj = cPickle.load(fin)
ImportError: No module named SirIsaac.gaussianPrior
I think the pickle file is saving the objects with entire object name as SirIsaac.fittingProblem
and not as fittingProblem
. I'm wondering what will be the best way to fix this. I tried several things it didn't work. There should be an easier way to fix this than adding those modules one by one.
I also tried with only line: sys.modules['SirIsaac.FittingProblem'] = fittingProblem
it didn't work. I also encountered this problem while running simpleExample.ipynb
where I had to change case (capitalization) of the module name in import
to run the code. (That's also some unrelated issue.)
It looks like there are maybe two things going on here:
1) The capitalization issue (FittingProblem
vs fittingProblem
, etc.) should be fixed if you use just the develop
branch. The master
branch still uses the old capitalized versions (FittingProblem
), but everything should be consistent on the develop
branch (fittingProblem
). If not, this is a bug that should be fixed. It's possible that if you ran some code using the master
branch and then switched to the develop
branch you could still have some of the old files hanging around.
2) There's some issue with the code not being able to find the SirIsaac
package when it's loading the .data
file. My guess is that you have not installed SirIsaac in a place where it can be found on the Python path. You might try installing SirIsaac using pip (using the develop
branch): pip install -e [path to SirIsaac folder]
. I didn't realize this could be a problem—if this ends up working for you, let me know, and I'll mention this in the installation instructions.
OK. I got the error in simplePickle.py
sorted. I switched to develop
branch. I did a clean build by removing SirIsaac and SloppyCell. Also reinstalled the virtualenv.
I have also made the change in SIRISAACDIR
. The error remains the same. I also noticed that the mpirun
command runs for a while (more than 15 minutes) without any errors (I stopped it after that) but when I try to run simpleExample.py
with numprocs=2
it gives me same error I got in the beginning, which also appears very quickly, so I think it fails to run the mpirun
command properly in the generateEnsembleParallel.py
. Do you have any suggestions regarding this?
Also although the mpirun -np 2 python SirIsaac/generateEnsembleParallel.py temporary_XXX_generateEnsemble_pypar_inputDict.data
command spawns 2 processes (note -np 2
), there is only one line saying that it is generating ensembles. Is that correct behavior?
Thanks for the info. Just to confirm: you now have SirIsaac in the Python path so that running mpirun
directly does not immediately give an error, but you still get a quick error in generateEnsembleParallel.py
when running p.fitAll()
in simpleExample.py
? Strange...
It will take me some more time to debug this fully. (I think I will need to figure out how to correctly pass errors from the mpi processes back to the python process.)
In the meantime, could you tell me what system you're running on and what version of mpi that you're using?
I checked whether path variable was set. It was blank. I was able to import SirIsaac
in the mpirun -np 2 python test.py
though.
My system details are as follows:
$ cat /proc/version
Linux version 5.4.24-1-MANJARO (builder@ba48f5931f62) (gcc version 9.2.1 20200130 (Arch Linux 9.2.1+20200130-2)) #1 SMP PREEMPT Thu Mar 5 20:29:25 UTC 2020
$ python --version
Python 2.7.17
$ mpirun --version
mpirun (Open MPI) 4.0.2
@bcdaniels Sorry for commenting again but I think this may be helpful. I tried running the code in fix-parallel-error-msgs
branch. It seems to have fixed the issue of no mpirun errors displaying. I think pip install -e [path to SirIsaac folder]
should be added to installation instructions. Usually setup.py
should take care of it, shouldn't it?
But now the code is running for more than half an hour and still doing ensemble generation:
SloppyCellFittingModel.fitToData: generating ensemble for these parameters: ['log_beta_0', 'g_0_0']
generateEnsemble_pypar: Generating parameter ensemble with 1000.0 total members, using 2 processors.
It is not showing information about what it is doing like in the case of numproc=1
(Warning messages etc.). I think it is to be expected, because according to your implementation, stdout
of mpirun
is displayed after completion of mpirun
process. I think information on this website may help. Also it should have generated that ensemble by half an hour of running the code, shouldn't it? For one processor it is done in much less time.
Thanks for the message! This is very useful. I'll go ahead and merge in the fix-parallel-error-msgs
branch and update the installation instructions.
As for why the parallel case seems to be hanging, I'm not sure. I'll run some tests on my computer to make sure it's working for me. If you know of a way to get messages to be displayed in real time from the parallel processes, that would be great. I'm no expert on Unix pipes, etc.
Testing this on my own machine (Mac OS 10.14.6, Python 2.7.17, mpirun (Open MPI) 4.0.1), I seem to be having the same issue. Running the example in simpleExample.py
hangs when trying to run in parallel, but not in serial. My only guess for the moment is that newer versions of MPI may be causing an issue. I did have problems a while back using Open MPI 3, so I reverted back to 2.1.1, which seemed to work. Perhaps we are seeing the same issue with Open MPI 4.
I will look more into this in the coming days. Note that this may be an issue with pypar, which we eventually want to move away from anyway (see issue #12).
I made a small code to get realtime output from the MPI process. See attached files.
MPI realtime output example.zip
Run the test.py
file. Note: I've imported SirIsaac
in the mpi process just to see if it works.
Unfortunately I cannot downgrade openMPI on my computer. I'll try it with other MPI software meanwhile.
Hi @bcdaniels , I know the code is not complete yet but I modified the code in the pypar-to-mpi4py
branch to output the stdout
from mpirun
process. It worked. I could get output from mpirun
process, but there seems to be no output coming from the MPI process after it starts generating ensemble. It just hangs after printing:
MPI : generateEnsemble: Generating parameter ensemble with 1000.0 total members, using 1 processor.
Any idea, what's happening here?
Also I can share the code for showing the mpirun
process output in real time. Tell me if you want it, because I'm not sure whether it will be useful after you switch to mpi4py. I think it will be useful if you run mpirun
command inside your python script for mpi4py as well.
Thanks! Yes, I'm still debugging to figure out the issue with hanging while generating the ensemble in parallel. I'll post any progress here soon.
Sure, your code might come in handy—best might be for you to fork the repository, include your code in a new branch, and then open a pull request here and we can merge it in once everything is debugged? (Then you'd get proper credit for your contribution.)
Just a note to say that I'm still working on this. My current hypothesis is that this is an issue with SloppyCell's parallel code (see https://github.com/GutenkunstLab/SloppyCell/issues/2).
This has been fixed with pull #24 . Please let me know if you still have issues with running the example code.
Thanks @bcdaniels ,
I cloned the development branch and installed using setup.py
. I'm getting following error when I run simpleExample.py
with numprocs=2:
Error log.
The error has changed, so it is not hanging anymore, but still there seems to be some bug.
Sorry this took so long, but we now have a major update in that SirIsaac (and SloppyCell) are running completely in python 3 (#29). My hunch is that this will fix these remaining issues with running in parallel. If you're still interested, let me know if you have issues with running the example code using the new version under python 3. Otherwise I'll close this with the assumption that it is working. Thanks!
I'm trying to run the simpleExample. When I set numprocs equal to 1, everything works fine. However, when I set numprocs to be greater than 1, an exception occurred.
I run it in a Docker container. The system is ubuntu 16.04. I use anaconda to manage python, and Created a Python 2.7 environment called SirIssac.
I don't know how to fix it.
Here is the message:
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[57214,1],0] Exit code: 2
Exception Traceback (most recent call last)