Open benjisimone opened 7 months ago
Response from Adiv: Hi Benji,
Sorry to hear you've been having issues! It looks like the problem is that the pyfft module failed to compile; the model itself appears to be running fine (and you should notice files like "MOST.00000" and "MOST_SNAP.00000" in the crash/working folder. pyfft is a fortran module that gets compiled into something python can import using the numpy module f2py. This is unfortunately probably the least stable module in numpy, and yet is the most reliable way to interface python and fortran (and trust me, you do not want the routines in pyfft.f90 to be run in native python; processing the raw outputs into the appropriate derived variables can be computationally intensive and you want that being done by compiled routines in a fast language). The pyfft compilation step happens when ExoPlaSim is configured/used for the first time, via the configure.sh script in the installation directory, and it out to generate pyfft.so. The error you're getting indicates that pyfft.so is missing, so there was probably an error on that initial configuration. Sometimes this can be solved by just re-running the configuration (and that reminds me I was going to add a package function for refreshing the configuration; I'll do that sometime in the next few days)---go to the directory returned by python3 -c "import exoplasim; print(exoplasim.path)" and run ./configure.sh. If that spits error messages related to f2py and pyfft, that might yield clues as to what's going on.
Key things for pyfft to compile correctly:
You can also try compiling pyfft directly with f2py by going to the installation directory (the directory returned by exoplasim.path), and running f2py3 -c -m --f90exec=gfortran --f77exec=gfortran --f90flags="-O3" pyfft pyfft.f90 && mv pyfft.cpython.so pyfft.so There is a related command that should also be run, f2py3 -c -m --f90exec=gfortran --f77exec=gfortran --f90flags="-O3" pyfft991 pyfft991.f90 && mv pyfft991.cpython.so pyfft991.so which compiles the pyfft991 module, used for certain resolutions like T63 for which the spherical legendre transformation has to be computed differently, but if you're just using T21 and T42 resolutions, you'll never encounter any errors even if that's broken.
It's possible that just rerunning those commands won't solve the problem and will spit out error messages, but those error messages might help you troubleshoot which libraries aren't being found, if any. Let me know if it turns out not to be a library problem and instead some other thornier problem; I'll see if I can come up with useful suggestions.
Cheers, Adiv
Another issue: Dear Adiv,
I trust this message finds you well. I am writing to seek your guidance on a new challenge I've encountered while running ExoPlaSim on a different cluster. Previously, your support proved instrumental in resolving compatibility and module errors, and for that, I am sincerely grateful.
In the current scenario, I have successfully installed ExoPlaSim on the new cluster, navigating through the compilation of the pyfft module. However, when attempting a test simulation following the provided tutorial, a series of "Permission Denied" errors surfaced, leading to a subsequent runtime crash. The error messages, which I have attached for your reference, include:
bash
touch: cannot touch 'firstrun': Permission denied
./configure.sh: line 73: most_precision_options: Permission denied
./configure.sh: line 74: most_precision_optionsx: Permission denied
./configure.sh: line 92: most_compiler: Permission denied
./configure.sh: line 93: most_compiler: Permission denied
./configure.sh: line 94: most_compiler: Permission denied
...
cp: cannot stat '/opt/Python/Python-3.11.4/lib/python3.11/site-packages/exoplasim/plasim/run/most_plasim_t21_l10_p10': Permission denied
rm: cannot remove '/home/bgonzalez/data_exoplasim/toi700d_run/plasim_restart': No such file or directory
--------------------------------------------------------------------------
mpiexec was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
...
mkdir: cannot create directory ‘/home/bgonzalez/data_exoplasim/TOI-700d_crashed’: File exists
Traceback (most recent call last):
...
File ".../exoplasim_test.py", line 25, in <module>
toi700d.run(years=10, crashifbroken=True)
...
RuntimeError: ExoPlaSim has crashed or begun producing garbage. All working files have been moved to /home/bgonzalez/TOI-700d_crashed
It's worth mentioning that this "Permission Denied" issue is occurring for the first time and is specific to this cluster. The cluster employs SLURM (Simple Linux Utility for Resource Management), an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters.
As I'm not an administrator on this cluster, and the permissions on the directories seem correct, I am seeking your expertise to understand the potential reasons behind these errors and any possible solutions. Do you think the SLURM system might be contributing to these issues? It's important to note that this problem has not arisen on other clusters.
Following up on this, I have a few more questions to deepen my understanding of the situation:
Familiarity with the Error: Have you encountered similar "Permission Denied" errors in ExoPlaSim simulations before, specifically related to the SLURM system or similar? If so, could you share any insights or common causes you might be aware of?
Execution Process and pyfft: Could you provide a brief overview of the typical execution process of ExoPlaSim, especially concerning the interaction with the pyfft module? Understanding how ExoPlaSim handles permissions during execution might shed light on the current issue.
Correlation with pyfft: Given that the cluster installation was performed remotely by another individual, is there a possibility that an incorrect configuration or compilation of the pyfft module could contribute to the "Permission Denied" errors observed in the simulation? If so, are there specific aspects of pyfft that I should verify or troubleshoot?
Process for Correct pyfft Compilation: If the pyfft module is indeed a potential source of the problem, could you outline the correct process for compiling pyfft within the ExoPlaSim framework? This information would be beneficial for ensuring that the compilation was done accurately. Previously, as you recomended us, we use:
a)going to the installation directory (the directory returned by exoplasim.__path__),
b) running: f2py3 -c -m --f90exec=gfortran --f77exec=gfortran --f90flags="-O3" pyfft pyfft.f90 && mv pyfft.cpython*.so pyfft.so
c)running:f2py3 -c -m --f90exec=gfortran --f77exec=gfortran --f90flags="-O3" pyfft991 pyfft991.f90 && mv pyfft991.cpython*.so pyfft991.so
I appreciate your time and willingness to assist. I would also like to express my gratitude for your previous support, which allowed me to successfully simulate planets. However, given the change in clusters and the transition to the final atmospheric model for my research, I am reaching out for your guidance once again.
Thank you for your ongoing support.
Best regards,
His response: Hi Benji,
Your compilation process looks correct; the problem is likely where exoplasim is getting installed in your system and the permissions you as a user have. That appears to be the only problem; I don't see anything more complicated going on than that. The compilation failed because the process attempting to compile and then run it didn't have permission to actually run the compilation. I haven't run into this with ExoPlaSim, but I have with other packages. You may need to ask your cluster sysadmin for help. In the meantime, one thing to try is to use the --user flag with pip install. This tells pip to install the package not in the system python tree (which often requires root permissions to modify), but in the user's package tree within the home folder. That ought to already be the default behaviour, though, at least I believe it is in most recent versions of Python.
Here's your typical ExoPlaSim workflow:
Compile the model if necessary.
Take user input and assemble the full set of simulation parameters, and write them to namelists for the Fortran core.
Run the Fortran core, simulating 1 year (unless otherwise set in the simulation parameters).
Run pyburn, which reads the raw binary output from the Fortran core, and uses pyfft to transform variables given in spherical harmonics to grid space, computes various derived variables, and then writes the results into the chosen data format.
Return to Step 3 unless the stop condition has been reached.
The pyfft module is literally a Fortran module containing the FFTs that PlaSim uses to do the spherical transforms between spectral space and grid space, with additional decorators that can be understood by numpy's f2py utility. At configuration time, we use f2py to compile the Fortran code into an importable python library. This shouldn't be contributing to or causing permissions errors; those errors are caused by something upstream. Hopefully your sysadmin can help you figure out what the issue is and where best to install ExoPlaSim.
Hope that helps, Adiv
Re: pyfft not compiling - Try to move the module name in the script to right behind -m, as it seems that fully resolved it not compiling for me (#16), like so:
python$pyversion -m numpy.f2py -c -m pyfft --f90exec=gfortran --f77exec=gfortran --f90flags="-O3" pyfft.f90 && mv pyfft.cpython*.so pyfft.so
python$pyversion -m numpy.f2py -c -m pyfft991 --f90exec=gfortran --f77exec=gfortran --f90flags="-O3" pyfft991.f90 && mv pyfft991.cpython*.so pyfft991.so
Otherwise running f2py off the python commandline complains about there not being a module name passed to it, and this seems to make it happy....
Here is a mailing thread that I had with Adiv that helped me to run ExoPlaSim, if it is useful for someone. I understand that ExoPlaSim was updated to 3.3 version. I managed to make it work on Python3.11.6 finally.
Here is the thread: Dear Adiv,
I hope this email finds you well. I am writing to you as a master's thesis student in astronomy at the University of Concepción, Chile, working on a study of the spectroscopic effect of atmospheric dynamics on aquaplanets. I have encountered a problem while using ExoPlaSim and would greatly appreciate your assistance in resolving it.
I have been experiencing compatibility issues and missing module errors during the execution of ExoPlaSim. Here is an extract of the error message I encountered:
At line 799 of file plasim.f90 Fortran runtime warning: An array temporary was created for argument '_formal_183' of procedure 'mpputgp' No module named 'exoplasim.pyfft' No module named 'exoplasim.pyfft' Error writing output to MOST.00000.nc; log written to postprocess.log Going to stop here just in case...... ExoPlaSim has crashed or begun producing garbage. All working files have been moved to /home/gema/ExoPlaSim-master/mymodel_aquaplanet_crashed/ mkdir: cannot create directory ‘/home/gema/ExoPlaSim-master/mymodel_aquaplanet_crashed’: File exists mv: cannot stat '/home/gema/ExoPlaSim-master/mymodel_aquaplanet_testrun/*': No such file or directory Traceback (most recent call last): File "/home/gema/ExoPlaSim-master/exoplasim/__init__.py", line 1112, in postprocess pyburn.postprocess(inputfile,inputfile+self.extension,logfile=log,namelist=namelist, File "/home/gema/ExoPlaSim-master/exoplasim/pyburn.py", line 3685, in postprocess data = dataset(rawfile, variables, mode=mode,radius=radius,gravity=gravity,gascon=gascon, File "/home/gema/ExoPlaSim-master/exoplasim/pyburn.py", line 1389, in dataset rawdata = readfile(filename) File "/home/gema/ExoPlaSim-master/exoplasim/pyburn.py", line 594, in readfile import exoplasim.pyfft as pyfft ModuleNotFoundError: No module named 'exoplasim.pyfft'
I have made several attempts to address this issue based on suggestions I found. Initially, I noticed a compatibility problem with numpy version 1.14, so I tried to uninstall both numpy and exoplasim and then installed numpy version 1.22. I also attempted to use virtual environments to ensure compatibility. However, these efforts did not resolve the issue.
Additionally, I came across a similar error in a forum discussion (https://worldbuildingpasta.blogspot.com/2021/11/an-apple-pie-from-scratch-part-vi.html), where I followed the recommendation to add some files to the directory "home/gema/.local/bin" to address compatibility problems. Unfortunately, this solution did not work for me either. I have included a zip file with the mentioned files and my two versions of the code (one for "gema" and another for "benji").
In some cases, the compatibility error did not appear when executing the exo.run() command. However, at that point, the console did not display any output. I observed CPU activity using htop, which showed that calculations were being performed.
To provide you with more context, I ran the program on two different computers. "gema" has 32 cores with a clock speed of 4.2GHz, while "benji" has 8 cores with a clock speed of 3.2GHz. I used the code configured to use 12 cores on "gema" and 4 cores on "benji".
Given the unsuccessful attempts to resolve the issue using available resources, including reading the PlaSim and ExoPlaSim manuals, I am reaching out to you for assistance. I kindly request your guidance in resolving this compatibility error and missing module issue.
Thank you very much for your attention and support. I appreciate your time and expertise in helping me overcome this obstacle.
Kindly,