EmoryUniversityTheoreticalBiophysics / SirIsaac

Automated dynamical systems inference
MIT License
38 stars 11 forks source link

Parallel run error #17

Closed IntegerLuoHua closed 2 years ago

IntegerLuoHua commented 5 years ago

I'm trying to run the simpleExample. When I set numprocs equal to 1, everything works fine. However, when I set numprocs to be greater than 1, an exception occurred.

I run it in a Docker container. The system is ubuntu 16.04. I use anaconda to manage python, and Created a Python 2.7 environment called SirIssac.

I don't know how to fix it.

Here is the message:

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[57214,1],0] Exit code: 2

Exception Traceback (most recent call last)

in () 2 3 time_start=time.time() ----> 4 p.fitAll() 5 time_end=time.time() 6 /data/SirIsaac-master/FittingProblem.pyc in fitAll(self, **kwargs) 852 853 def fitAll(self,**kwargs): --> 854 FittingProblem.fitAll(self,**kwargs) 855 # we also want to save the convergence information in a 856 # convenient location: /data/SirIsaac-master/FittingProblem.pyc in fitAll(self, usePreviousParams, fitPerfectModel, resume, maxNumFit, **kwargs) 224 fittingModel.fitToData(self.fittingData,self.indepParamsList, \ 225 otherStartingPoint=smallerBestParams, \ --> 226 fittingDataDerivs=fittingDataDerivs,**kwargs) 227 228 if not hasattr(self,'fittingDataDerivs'): /data/SirIsaac-master/FittingProblem.pyc in fitToData(self, fittingData, indepParamsList, _unclampedSpeciesID, otherStartingPoint, fittingDataDerivs, createEnsemble) 1623 if self.numprocs > 1: 1624 ens,ratio = self.ensGen.generateEnsemble_pypar(self.numprocs, -> 1625 dataModel,initialParameters,verbose=self.verbose) 1626 else: 1627 ens,ratio = self.ensGen.generateEnsemble(dataModel, /data/SirIsaac-master/FittingProblem.pyc in generateEnsemble_pypar(self, numprocs, dataModel, initialParameters, returnCosts, scaleByDOF, verbose) 2629 os.remove(prefix+"stdout.txt") 2630 raise Exception, "generateEnsemble_pypar:" \ -> 2631 + " error in generateEnsembleParallel.py" 2632 2633 return output Exception: generateEnsemble_pypar: error in generateEnsembleParallel.py
IntegerLuoHua commented 5 years ago

I think I may find the answer.

I should modify SIRISAACDIR!!!

bcdaniels commented 5 years ago

Yes, apologies that the current parallelization setup is a bit hacky. Changing SIRISAACDIR should work for now—this will be done automatically during installation in a future release.

sidambhire commented 4 years ago

This is not fixed for me by changing SIRISAACDIR. I'm getting following error:

Failed to import libsbml.
SBML import and export not available.
Failed to import pygraphviz.  Network figures unavailable.
SloppyCellFittingModel.fitToData: generating ensemble for these parameters: ['log_beta_0', 'g_0_0']
generateEnsemble_pypar: Generating parameter ensemble with 1000.0 total members, using 2 processors.
Traceback (most recent call last):
  File "/home/sid/pyProjects/SirIsaac-develop/simpleExample.py", line 150, in <module>
    p.fitAll()
  File "/home/sid/pyProjects/SirIsaac-develop/SirIsaac/fittingProblem.py", line 845, in fitAll
    FittingProblem.fitAll(self,**kwargs)
  File "/home/sid/pyProjects/SirIsaac-develop/SirIsaac/fittingProblem.py", line 217, in fitAll
    fittingDataDerivs=fittingDataDerivs,**kwargs)
  File "/home/sid/pyProjects/SirIsaac-develop/SirIsaac/fittingProblem.py", line 1619, in fitToData
    dataModel,initialParameters,verbose=self.verbose)
  File "/home/sid/pyProjects/SirIsaac-develop/SirIsaac/fittingProblem.py", line 2643, in generateEnsemble_pypar
    + " error in generateEnsembleParallel.py"
Exception: generateEnsemble_pypar: error in generateEnsembleParallel.py
generateEnsemble_pypar error:

Process finished with exit code 1

Please help.

bcdaniels commented 4 years ago

Hello @sidambhire — thanks for your question.

Did you try running with numprocs = 1? Does that produce any errors?

sidambhire commented 4 years ago

Hello @bcdaniels I tried running it with numprocs = 1. There were no errors at least in the beginning. It was running too slow so I did not let it run all the way till the end.

I also tried to dig in to the code. I think the error most probably is in "generateEnsembleParallel.py": most probably in the way the data is passed to the MPI processes. How do I get more info about the error from the MPI processes?

bcdaniels commented 4 years ago

Hmm, yes, it seems that there is an error in one of the parallel processes, but only when you run in parallel.

If you feel comfortable with debugging a bit, you might try this: 1) Interrupt the code in fittingProblem.generateEnsemble_pypar by adding, say, a call to exit() just before the line that says # call mpi (around line 2623). 2) Running your simpleExample.py code should now produce the temporary file that gets passed to the MPI processes, which will be called something like temporary_XXX_generateEnsemble_pypar_inputDict.data. 3) Then you can try to run mpi from the command line yourself by typing the following (replace the last part with the actual .data filename for your system)

mpirun -np 10 python SirIsaac/generateEnsembleParallel.py temporary_XXX_generateEnsemble_pypar_inputDict.data

With any luck, this will give you more information about the actual error that's happening in the parallel code. Let me know if you encounter any other trouble here.

sidambhire commented 4 years ago

Hi @bcdaniels , I did the debugging steps you asked for. Here's the relevant part of my output:

Traceback (most recent call last):
  File "SirIsaac/generateEnsembleParallel.py", line 38, in <module>
    inputDict = load(inputDictFile)
  File "/home/sid/pyProjects/SirIsaac/SirIsaac-develop/SirIsaac/simplePickle.py", line 21, in load
    obj = cPickle.load(fin)
ImportError: No module named SirIsaac.fittingProblem

I added following lines to the code in simplePickle.py at the begining:

sys.modules['SirIsaac.FittingProblem'] = fittingProblem
sys.modules['SirIsaac.fittingProblem'] = fittingProblem

It gave me following error when I ran mpirun command:

Traceback (most recent call last):
  File "SirIsaac/generateEnsembleParallel.py", line 38, in <module>
    inputDict = load(inputDictFile)
  File "/home/sid/pyProjects/SirIsaac/SirIsaac-develop/SirIsaac/simplePickle.py", line 23, in load
    obj = cPickle.load(fin)
ImportError: No module named SirIsaac.gaussianPrior

I think the pickle file is saving the objects with entire object name as SirIsaac.fittingProblem and not as fittingProblem. I'm wondering what will be the best way to fix this. I tried several things it didn't work. There should be an easier way to fix this than adding those modules one by one.

I also tried with only line: sys.modules['SirIsaac.FittingProblem'] = fittingProblem it didn't work. I also encountered this problem while running simpleExample.ipynb where I had to change case (capitalization) of the module name in import to run the code. (That's also some unrelated issue.)

bcdaniels commented 4 years ago

It looks like there are maybe two things going on here: 1) The capitalization issue (FittingProblem vs fittingProblem, etc.) should be fixed if you use just the develop branch. The master branch still uses the old capitalized versions (FittingProblem), but everything should be consistent on the develop branch (fittingProblem). If not, this is a bug that should be fixed. It's possible that if you ran some code using the master branch and then switched to the develop branch you could still have some of the old files hanging around. 2) There's some issue with the code not being able to find the SirIsaac package when it's loading the .data file. My guess is that you have not installed SirIsaac in a place where it can be found on the Python path. You might try installing SirIsaac using pip (using the develop branch): pip install -e [path to SirIsaac folder]. I didn't realize this could be a problem—if this ends up working for you, let me know, and I'll mention this in the installation instructions.

sidambhire commented 4 years ago

OK. I got the error in simplePickle.py sorted. I switched to develop branch. I did a clean build by removing SirIsaac and SloppyCell. Also reinstalled the virtualenv.

I have also made the change in SIRISAACDIR. The error remains the same. I also noticed that the mpirun command runs for a while (more than 15 minutes) without any errors (I stopped it after that) but when I try to run simpleExample.py with numprocs=2 it gives me same error I got in the beginning, which also appears very quickly, so I think it fails to run the mpirun command properly in the generateEnsembleParallel.py. Do you have any suggestions regarding this?

Also although the mpirun -np 2 python SirIsaac/generateEnsembleParallel.py temporary_XXX_generateEnsemble_pypar_inputDict.data command spawns 2 processes (note -np 2), there is only one line saying that it is generating ensembles. Is that correct behavior?

bcdaniels commented 4 years ago

Thanks for the info. Just to confirm: you now have SirIsaac in the Python path so that running mpirun directly does not immediately give an error, but you still get a quick error in generateEnsembleParallel.py when running p.fitAll() in simpleExample.py? Strange...

It will take me some more time to debug this fully. (I think I will need to figure out how to correctly pass errors from the mpi processes back to the python process.)

In the meantime, could you tell me what system you're running on and what version of mpi that you're using?

sidambhire commented 4 years ago

I checked whether path variable was set. It was blank. I was able to import SirIsaac in the mpirun -np 2 python test.py though.

My system details are as follows:

$ cat /proc/version
Linux version 5.4.24-1-MANJARO (builder@ba48f5931f62) (gcc version 9.2.1 20200130 (Arch Linux 9.2.1+20200130-2)) #1 SMP PREEMPT Thu Mar 5 20:29:25 UTC 2020
$ python --version
Python 2.7.17
$ mpirun --version
mpirun (Open MPI) 4.0.2
sidambhire commented 4 years ago

@bcdaniels Sorry for commenting again but I think this may be helpful. I tried running the code in fix-parallel-error-msgs branch. It seems to have fixed the issue of no mpirun errors displaying. I think pip install -e [path to SirIsaac folder] should be added to installation instructions. Usually setup.py should take care of it, shouldn't it?

But now the code is running for more than half an hour and still doing ensemble generation:

SloppyCellFittingModel.fitToData: generating ensemble for these parameters: ['log_beta_0', 'g_0_0']
generateEnsemble_pypar: Generating parameter ensemble with 1000.0 total members, using 2 processors.

It is not showing information about what it is doing like in the case of numproc=1 (Warning messages etc.). I think it is to be expected, because according to your implementation, stdout of mpirun is displayed after completion of mpirun process. I think information on this website may help. Also it should have generated that ensemble by half an hour of running the code, shouldn't it? For one processor it is done in much less time.

bcdaniels commented 4 years ago

Thanks for the message! This is very useful. I'll go ahead and merge in the fix-parallel-error-msgs branch and update the installation instructions.

As for why the parallel case seems to be hanging, I'm not sure. I'll run some tests on my computer to make sure it's working for me. If you know of a way to get messages to be displayed in real time from the parallel processes, that would be great. I'm no expert on Unix pipes, etc.

bcdaniels commented 4 years ago

Testing this on my own machine (Mac OS 10.14.6, Python 2.7.17, mpirun (Open MPI) 4.0.1), I seem to be having the same issue. Running the example in simpleExample.py hangs when trying to run in parallel, but not in serial. My only guess for the moment is that newer versions of MPI may be causing an issue. I did have problems a while back using Open MPI 3, so I reverted back to 2.1.1, which seemed to work. Perhaps we are seeing the same issue with Open MPI 4.

I will look more into this in the coming days. Note that this may be an issue with pypar, which we eventually want to move away from anyway (see issue #12).

sidambhire commented 4 years ago

I made a small code to get realtime output from the MPI process. See attached files. MPI realtime output example.zip Run the test.py file. Note: I've imported SirIsaac in the mpi process just to see if it works.

Unfortunately I cannot downgrade openMPI on my computer. I'll try it with other MPI software meanwhile.

sidambhire commented 4 years ago

Hi @bcdaniels , I know the code is not complete yet but I modified the code in the pypar-to-mpi4py branch to output the stdout from mpirun process. It worked. I could get output from mpirun process, but there seems to be no output coming from the MPI process after it starts generating ensemble. It just hangs after printing:

MPI   : generateEnsemble: Generating parameter ensemble with 1000.0 total members, using 1 processor.

Any idea, what's happening here?

Also I can share the code for showing the mpirun process output in real time. Tell me if you want it, because I'm not sure whether it will be useful after you switch to mpi4py. I think it will be useful if you run mpirun command inside your python script for mpi4py as well.

bcdaniels commented 4 years ago

Thanks! Yes, I'm still debugging to figure out the issue with hanging while generating the ensemble in parallel. I'll post any progress here soon.

Sure, your code might come in handy—best might be for you to fork the repository, include your code in a new branch, and then open a pull request here and we can merge it in once everything is debugged? (Then you'd get proper credit for your contribution.)

bcdaniels commented 4 years ago

Just a note to say that I'm still working on this. My current hypothesis is that this is an issue with SloppyCell's parallel code (see https://github.com/GutenkunstLab/SloppyCell/issues/2).

bcdaniels commented 4 years ago

This has been fixed with pull #24 . Please let me know if you still have issues with running the example code.

sidambhire commented 4 years ago

Thanks @bcdaniels , I cloned the development branch and installed using setup.py. I'm getting following error when I run simpleExample.py with numprocs=2: Error log. The error has changed, so it is not hanging anymore, but still there seems to be some bug.

bcdaniels commented 2 years ago

Sorry this took so long, but we now have a major update in that SirIsaac (and SloppyCell) are running completely in python 3 (#29). My hunch is that this will fix these remaining issues with running in parallel. If you're still interested, let me know if you have issues with running the example code using the new version under python 3. Otherwise I'll close this with the assumption that it is working. Thanks!