conradtchan / starfit

Fit stellar abundance measurements to stellar models
Other
2 stars 0 forks source link

Genetic Algorithm runs forever #13

Closed conradtchan closed 1 year ago

conradtchan commented 1 year ago

starfit.Ga sometimes never finishes running, despite time_limit being set.

ThomasNordlander commented 1 year ago
$ python3.10
Python 3.10.5 (main, Jun  7 2022, 08:39:11) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin

Test script, taken from the README:

thomasn@rsaa-068080:~$ cat test_GA.py 
import starfit
s = starfit.Ga(
    filename = 'HE1327-2326.dat',
    db = 'znuc2012.S4.star.el.y.stardb.gz',
    combine = [[6, 7, 8]],
    z_max = 30,
    z_exclude = [3, 24, 30],
    z_lolim = [21, 29],
    upper_lim = True,
    cdf = True,
    time_limit=20,
    sol_size=3,
)

This works roughly 1/3 times, but will otherwise run forever. The output is the exact same, aside from timing outputs. Same result regardless if I change time_limit. This is what it looks like for me:

thomasn@rsaa-068080:~$ time python3.10 test_GA.py 
 [BBNAbu] Loading /Users/thomasn/Library/Python/3.10/lib/python/site-packages/starfit/data/ref/bbnf02.dat (1.5 kiB)
 [BBNAbu]  16 isotopes loaded in 1.2 ms.
 [Star] Loading /Users/thomasn/Library/Python/3.10/lib/python/site-packages/starfit/data/stars/HE1327-2326.dat (3.0 kiB)
 [Star]  Star loaded and converted in 2.8 ms.
 [TrimDB] Loading /Users/thomasn/Library/Python/3.10/lib/python/site-packages/starfit/data/db/znuc2012.S4.star.el.y.stardb.gz
 [TrimDB] File size: 12 MiB (compressed).
 [TrimDB] Not swapping endian.
 [TrimDB] file integrity seems OK
 [TrimDB] Data version:  10100
 [TrimDB] data set name: znucS4.mixlib.el
 [TrimDB] ==========================================================
 [TrimDB] COMMENT: znuc S=4 star data set
 [TrimDB] ==========================================================
 [TrimDB] data sets:       17640
 [TrimDB] abundance sets:     83
 [TrimDB] ----------------------------------------------------------
 [TrimDB] abundance type:  2 - element
 [TrimDB] abundance class: 2 - dec (stable subset of radiso)
 [TrimDB] abundance unit:  4 - mol fraction (YPS)
 [TrimDB] abundance total: 1 - ejecta
 [TrimDB] abundance norm:      (NONE)
 [TrimDB] abundance data:  0 - all ejecta (SN ejecta + wind)
 [TrimDB] abundance sum:   1 - number fraction
 [TrimDB] ----------------------------------------------------------
 [TrimDB] 4 data fields: 
 [TrimDB] mass    [    solar masses] (DOUBLE) <parameter>
 [TrimDB] energy  [               B] (DOUBLE) <parameter>
 [TrimDB] mixing  [He core fraction] (DOUBLE) <parameter>
 [TrimDB] remnant [           M_sun] (DOUBLE) <property>
 [TrimDB] ----------------------------------------------------------
 [TrimDB] ABUNDANCES:
 [TrimDB] H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca
 [TrimDB] Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb
 [TrimDB] Sr Y Zr Nb Mo Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs
 [TrimDB] Ba La Ce Pr Nd Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta
 [TrimDB] W Re Os Ir Pt Au Hg Tl Pb Bi Th U
 [TrimDB] ----------------------------------------------------------
 [TrimDB] PARAMETER RANGES:
 [TrimDB] mass:         9.6 ...   100.0 (126 values)
 [TrimDB] energy:       0.3 ...    10.0 ( 10 values)
 [TrimDB] mixing:   0.00000 ... 0.25119 ( 14 values)
 [TrimDB] ----------------------------------------------------------
 [TrimDB] PROPERTY RANGES:
 [TrimDB] remnant:    1.195 ...  52.949 (941 values)
 [TrimDB] ----------------------------------------------------------
 [TrimDB] PARAMETER VALUES:
 [TrimDB] mass:
 [TrimDB]    9.6   9.7   9.8   9.9  10.0  10.1  10.2  10.3  10.4
 [TrimDB]   10.5  10.6  10.7  10.8  10.9  11.0  11.1  11.2  11.3
 [TrimDB]   11.4  11.5  11.6  11.7  11.8  11.9  12.0  12.2  12.4
 [TrimDB]   12.6  12.8  13.0  13.2  13.4  13.6  13.8  14.0  14.2
 [TrimDB]   14.4  14.6  14.8  15.0  15.2  15.4  15.6  15.8  16.0
 [TrimDB]   16.2  16.4  16.6  16.8  17.0  17.1  17.2  17.3  17.4
 [TrimDB]   17.5  17.6  17.7  17.8  17.9  18.0  18.1  18.2  18.3
 [TrimDB]   18.4  18.5  18.6  18.7  18.8  18.9  19.0  19.2  19.4
 [TrimDB]   19.6  19.8  20.0  20.5  21.0  21.5  22.0  22.5  23.0
 [TrimDB]   23.5  24.0  24.5  25.0  25.5  26.0  26.5  27.0  27.5
 [TrimDB]   28.0  28.5  29.0  29.5  30.0  30.5  31.0  31.5  32.0
 [TrimDB]   32.5  33.0  33.5  34.0  34.5  35.0  36.0  37.0  38.0
 [TrimDB]   39.0  40.0  41.0  42.0  43.0  44.0  45.0  50.0  55.0
 [TrimDB]   60.0  65.0  70.0  75.0  80.0  85.0  90.0  95.0 100.0
 [TrimDB] energy:
 [TrimDB]   0.3  0.6  0.9  1.2  1.5  1.8  2.4  3.0  5.0 10.0
 [TrimDB] mixing:
 [TrimDB]  0.00000 0.00100 0.00158 0.00251 0.00398 0.00631 0.01000
 [TrimDB]  0.01585 0.02512 0.03981 0.06310 0.10000 0.15849 0.25119
 [TrimDB] ----------------------------------------------------------
 [TrimDB] PROPERTY VALUES:
 [TrimDB] remnant:
 [TrimDB] (more than 100 values)
 [TrimDB] ----------------------------------------------------------
 [TrimDB] SHA1: 40e59b41fdc4c68dd236d3a1855aaa2ff1489a17
 [TrimDB] ----------------------------------------------------------
 [TrimDB] Data loaded in 59 ms.
 [SolAbu] Loading /Users/thomasn/Library/Python/3.10/lib/python/site-packages/starfit/data/ref/solas09_sol_surf_present.dat (6.7 kiB)
 [SolAbu] 287 isotopes loaded in 11 ms.
 [Ga] Combining elements:
 [Ga]     C+N+O
 [Ga] Matching 15 data points up to Z=30:
 [Ga]     H He Li C Na Mg Al Ca Ti Cr Mn Fe Co Ni Zn
 [Ga] with 8 upper limits in the data:
 [Ga]     H He Li Cr Mn Co Ni Zn
 [Ga] and 0 lower limits in the models:
 [Ga]     
 [Ga] Time limit: 20 s
^Z
[1]+  Stopped                 python3.10 test_GA.py

real    0m48.608s
user    0m0.000s
sys 0m0.002s

Pay no heed to the user timer being 0 - it really is using one CPU at 100 %.

2sn commented 1 year ago

@conradtchan has this been fixed? I though so.

conradtchan commented 1 year ago

@2sn no, not fixed yet. I am still unable to reproduce the bug.

@ThomasNordlander what type of mac do you have? Is it Intel, or Apple Silicon?

ThomasNordlander commented 1 year ago

I have a 2021 Macbook pro M1 Pro, so Apple silicon.

I don't know if it helps but I tried also making a nice and fresh install on python 3.9 instead. Same issue.

2sn commented 1 year ago

I just ran it on my mac mini hime M1 (2021), [current OS and homebrew+pip], finished in 20s the first time but got stuck the second time. Current distro 0.6.0 installed using pip3. Does not terminate with ^C., su I suppose it is stuck within the fortran section.

I was using ipython as shell.

2sn commented 1 year ago

@ThomasNordlander maybe user time is not reported correctly if spent in fortran module? (earlier message about timing)

2sn commented 1 year ago

I have actually the same issue on my linux workstation, GA gets stuck. Installed from pip3. In my test it stopped the first time I ran it. The second time it ran through. So I confirm both that it is a statistically but non-deterministic recurrent issue and that it is not related to he platform.

conradtchan commented 1 year ago

@ThomasNordlander @2sn which Fortran compiler do you have installed? I just tried running it on the starfit server (Fedora) and still could not reproduce it.

To see which compiler f2py is using:

f2py -c --help-fcompiler

In my case, it's the default compiler, so:

[fedora@starfit ~]$ cat /etc/fedora-release
Fedora release 36 (Thirty Six)
[fedora@starfit ~]$ gfortran --version
GNU Fortran (GCC) 12.2.1 20220819 (Red Hat 12.2.1-2)

On my mac:

❯ gfortran --version
GNU Fortran (Homebrew GCC 12.2.0) 12.2.0
ThomasNordlander commented 1 year ago

I'm lagging slightly behind you in GCC version:

$ /opt/local/bin/gfortran --version
GNU Fortran (MacPorts gcc12 12.1.0_6+stdlib_flag) 12.1.0

I also tried selecting gcc11

$ gfortran --version
GNU Fortran (MacPorts gcc11 11.3.0_4+stdlib_flag) 11.3.0

and reinstalled starfit with the same issue.

On OSX 12.6 Monterey by the way.

$ uname -a
Darwin rsaa-068080 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:19:52 PDT 2022; root:xnu-8020.140.49~2/RELEASE_ARM64_T6000 arm64

@2sn I think user time is only reported after the process exits for some reason, but at least real time is correct. The process really is chugging through CPU time so I think this is just a reporting issue.

The issue is so stochastic, I have to run the code repeatedly to find if it still fails. If I set time_limit=2 this works reasonably well:

parallel -j 1 --line-buffer --timeout=10 python3.10 test_GA.py ::: {1..10}
conradtchan commented 1 year ago

@ThomasNordlander Thanks for these details. I doubt it's an issue with the compiler then. The Fortran code is relatively simple, and doesn't handle any of the timing/loop control. All that is done in Python. I'll continue trying to reproduce the issue.

2sn commented 1 year ago

@conradtchan does not work on Linux:

~>f2py -c --help-fcompiler
Gnu95FCompiler instance properties:
  archiver        = ['/usr/bin/gfortran', '-cr']
  compile_switch  = '-c'
  compiler_f77    = ['/usr/bin/gfortran', '-Wall', '-g', '-ffixed-form', '-
                    fno-second-underscore', '-fPIC', '-O3', '-funroll-loops']
  compiler_f90    = ['/usr/bin/gfortran', '-Wall', '-g', '-fno-second-
                    underscore', '-fPIC', '-O3', '-funroll-loops']
  compiler_fix    = ['/usr/bin/gfortran', '-Wall', '-g', '-ffixed-form', '-
                    fno-second-underscore', '-Wall', '-g', '-fno-second-
                    underscore', '-fPIC', '-O3', '-funroll-loops']
  libraries       = ['gfortran']
  library_dirs    = ['/usr/lib/gcc/x86_64-redhat-linux/12',
                    '/usr/lib/gcc/x86_64-redhat-linux/12']
  linker_exe      = ['/usr/bin/gfortran', '-Wall', '-Wall']
  linker_so       = ['/usr/bin/gfortran', '-Wall', '-g', '-Wall', '-g', '-
                    shared']
  object_switch   = '-o '
  ranlib          = ['/usr/bin/gfortran']
  version         = LooseVersion ('12')
  version_cmd     = ['/usr/bin/gfortran', '-dumpversion']
Fortran compilers found:
  --fcompiler=gnu95  GNU Fortran 95 compiler (12)
Compilers available for this platform, but not found:
  --fcompiler=absoft   Absoft Corp Fortran Compiler
  --fcompiler=arm      Arm Compiler
  --fcompiler=compaq   Compaq Fortran Compiler
  --fcompiler=fujitsu  Fujitsu Fortran Compiler
  --fcompiler=g95      G95 Fortran Compiler
  --fcompiler=gnu      GNU Fortran 77 compiler
  --fcompiler=intel    Intel Fortran Compiler for 32-bit apps
  --fcompiler=intele   Intel Fortran Compiler for Itanium apps
  --fcompiler=intelem  Intel Fortran Compiler for 64-bit apps
  --fcompiler=lahey    Lahey/Fujitsu Fortran 95 Compiler
  --fcompiler=nag      NAGWare Fortran 95 Compiler
  --fcompiler=nagfor   NAG Fortran Compiler
  --fcompiler=nv       NVIDIA HPC SDK
  --fcompiler=pathf95  PathScale Fortran Compiler
  --fcompiler=pg       Portland Group Fortran Compiler
  --fcompiler=vast     Pacific-Sierra Research Fortran 90 Compiler
Compilers not available on this platform:
  --fcompiler=flang     Portland Group Fortran LLVM Compiler
  --fcompiler=hpux      HP Fortran 90 Compiler
  --fcompiler=ibm       IBM XL Fortran Compiler
  --fcompiler=intelev   Intel Visual Fortran Compiler for Itanium apps
  --fcompiler=intelv    Intel Visual Fortran Compiler for 32-bit apps
  --fcompiler=intelvem  Intel Visual Fortran Compiler for 64-bit apps
  --fcompiler=mips      MIPSpro Fortran Compiler
  --fcompiler=none      Fake Fortran compiler
  --fcompiler=sun       Sun or Forte Fortran 95 Compiler
For compiler details, run 'config_fc --verbose' setup command.
Removing build directory /tmp/tmpy9ynmfk5

maybe this suffices:

gcc (GCC) 12.2.1 20220819 (Red Hat 12.2.1-2)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2sn commented 1 year ago

and

~>gcc --version
Apple clang version 14.0.0 (clang-1400.0.29.102)
Target: arm64-apple-darwin21.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
2sn commented 1 year ago
~>gfortran --version
GNU Fortran (Homebrew GCC 12.2.0) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2sn commented 1 year ago

@conradtchan Since I can reproduce the issue, I (and maybe Thomas) can try to debug this ourselves outside the ADACS project, since we are already over time.

ThomasNordlander commented 1 year ago

Yes, I'm sure we will figure this out :)

Thanks for all your help, @conradtchan !

2sn commented 1 year ago

@ThomasNordlander I made a new version 0.6.3 that was intended to fix a different issue, but I did not have any hang-ups with it for you GA example.

2sn commented 1 year ago

Hmm, after a few tries I now actually did.

2sn commented 1 year ago

I think it may happen more frequently on the M1 as it is faster

ThomasNordlander commented 1 year ago

@2sn I upgraded to 0.6.3, and still find that roughly every second time the timeout does not stop the calculation. Just in case, I also tried with python 3.9 and 11.3, and find the same results.

2sn commented 1 year ago

@ThomasNordlander Yes, we will have to debug this using print statements in the code. Likely the hangup is in the FORTRAN module as it stops reacting to ^C, which it should do when running python code.

2sn commented 1 year ago

This will be a bit tedious as we'd have to re-build and re-install the FORTRAN module after each edit.
I did see it break on my M1 as well, just not the first 2 times.

ThomasNordlander commented 1 year ago

I hope just running it once with a million print statements would be enough to understand where it gets stuck. Or is there a way of doing this with traceback or a debugger or something to activate on termkill? (I also use the million print statement-strategy; I don't know anything about real coding)

2sn commented 1 year ago

@ThomasNordlander print is a food debugger. Since it hang up, debugger may not be an option. Does ^C give you a traceback? Probably not too much if not compiled with debug mode.

ThomasNordlander commented 1 year ago

No, ^C does nothing for me either, I have to kill the process and that naturally does not produce traceback.

2sn commented 1 year ago

@ThomasNordlander one needs to check whether #54 fixes it.

ThomasNordlander commented 1 year ago

@2sn I cloned the latest version and verified your fix is implemented. Unfortunately it ran forever already during pytest:

$ python -m pytest
================================================================ test session starts ================================================================
platform darwin -- Python 3.10.8, pytest-7.1.3, pluggy-1.0.0
rootdir: /Users/thomasn/starfit
collected 9 items                                                                                                                                   

tests/test_01_single.py .......                                                                                                               [ 77%]
tests/test_02_double.py .                                                                                                                     [ 88%]
tests/test_03_ga.py 

The bug is still stochastic though, so running a few more times it sometimes works fine:

$ python -m pytest
================================================================ test session starts ================================================================
platform darwin -- Python 3.10.8, pytest-7.1.3, pluggy-1.0.0
rootdir: /Users/thomasn/starfit
collected 9 items                                                                                                                                   

tests/test_01_single.py .......                                                                                                               [ 77%]
tests/test_02_double.py .                                                                                                                     [ 88%]
tests/test_03_ga.py .                                                                                                                         [100%]

================================================================ 9 passed in 22.92s =================================================================
2sn commented 1 year ago

I had been hoping some of the fixes (eliminating duplicate stars) would fix it. It seems rare, and hence only occurs frequently on fast machines. Thomas, could you please check whether it occurs when you set cdf=False?

On Fri, 21 Oct 2022 at 11:36, ThomasNordlander @.***> wrote:

@2sn https://github.com/2sn I cloned the latest version and verified your fix is implemented. Unfortunately it ran forever already during pytest:

$ python -m pytest ================================================================ test session starts ================================================================ platform darwin -- Python 3.10.8, pytest-7.1.3, pluggy-1.0.0 rootdir: /Users/thomasn/starfit collected 9 items

tests/test_01_single.py ....... [ 77%] tests/test_02_double.py . [ 88%] tests/test_03_ga.py

It's stochastic though, so running again it works fine:

$ python -m pytest ================================================================ test session starts ================================================================ platform darwin -- Python 3.10.8, pytest-7.1.3, pluggy-1.0.0 rootdir: /Users/thomasn/starfit collected 9 items

tests/test_01_single.py ....... [ 77%] tests/test_02_double.py . [ 88%] tests/test_03_ga.py . [100%]

================================================================ 9 passed in 22.92s =================================================================

— Reply to this email directly, view it on GitHub https://github.com/conradtchan/starfit/issues/13#issuecomment-1286312896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJW2FW4QVFIHEVLQ47NSI3WEHQOVANCNFSM6AAAAAARA6T4YM . You are receiving this because you were mentioned.Message ID: @.***>

2sn commented 1 year ago

@ThomasNordlander

ThomasNordlander commented 1 year ago
import starfit
s = starfit.Ga(
    filename = 'HE1327-2326.dat',
    db = 'znuc2012.S4.star.el.y.stardb.gz',
    combine = [[6, 7, 8]],
    z_max = 30,
    z_exclude = [3, 24, 30],
    z_lolim = [21, 29],
    upper_lim = True,
    cdf = False,
    time_limit=2,
    sol_size=3,
)

This still runs forever. Same if I set sol_size=2.

I looked a little bit in the source code, and it is indeed the call to psolve in fitness() that causes the freeze. I noticed this usually happens when at least one value of c(i,:) > 1. In psolve, x = atanh(c * 2.d0 - 1.d0) is calculated and then fed into uobyqa. I didn't read enough to understand what exactly c is, but I imagine it is the dilution factor? If so, a special case for unphysical dilutions is necessary to avoid NaNs.

2sn commented 1 year ago

Hmm, so I assume one could just limit c or the expression in atanh to 1-eps.

2sn commented 1 year ago

@ThomasNordlander very useful. I explore with debugging in psove

    write(6, "(A1)", advance='no') '.'

    if (any(c >= 1.d0)) then
       print*, '[psolve] DEBUG IN: c = ', c
       error stop 'c >= 1'
    endif

    !Convert offsets to solver space
    x = atanh(c * 2.d0 - 1.d0)

    !Call solver
    call uobyqa(nstar, x, rhobeg, rhoend, iprint, calls)
    !Convert solver space to offsets
    c = 0.5d0 * (1.d0 + tanh(x))

    if (any(c >= 1.d0)) then
       print*, '[psolve] DEBUG OUT: c = ', c
       error stop 'c >= 1'
    endif

end subroutine psolve

and get

 [Ga] Time limit: 20 s
......................................................................................... [psolve] DEBUG IN: c =    1.8722132635470222E-005   112208.66368774253        660413.62256249995     
ERROR STOP c >= 1

or

 [Ga] Time limit: 20 s
..... [psolve] DEBUG IN: c =   0.78321720745061418        4.2175812632876211        5.9357056476588794E-005
ERROR STOP c >= 1

which implies this is not created by the solver but is in the initial generation made by the GA setup.

2sn commented 1 year ago

The bug is rather subtle. In ga.py when the initial weights are computed,

        # Generate initial fitness values
        self.f = _fitness(
            self.trimmed_db,
            self.eval_data,
            self.exclude_index,
            self.s,
            fixed_offsets=fixed_offsets,
            ejecta=self.ejecta,
            cdf=cdf,
            ls=self.local_search,
        )

the last line setting ls was missing.

2sn commented 1 year ago

fixed by #59

2sn commented 1 year ago

@ThomasNordlander this should be fixed in starfit release 0.8.0.

ThomasNordlander commented 1 year ago

Brilliant, all good now in 0.8.0! Thanks!

2sn commented 1 year ago

I updated documentation in Version 0.8.1. Let me know if you spot any typos. (or just make pull request to fix them)

A next task will be to get useful data sets from colleagues. Anything you can suggest,