POSYDON-code / POSYDON

POSYDON is a next-generation single and binary-star population synthesis code incorporating full stellar structure and evolution modeling with the use of MESA.
BSD 3-Clause "New" or "Revised" License
29 stars 19 forks source link

Population Analysis Framework Refactor #267

Closed maxbriel closed 3 months ago

maxbriel commented 7 months ago

This address a few issues with the SyntheticPopulation class I ran into while selecting and analysis a different (specifically larger) population than the BBH population. (This PR replaces PR #200 )

  1. Being unable to select just GRB systems
  2. Unable to process the population in-memory

This PR drastically changes the workflow of the population analysis to provide a large flexibility for the user, ie. being able to look at other populations than the BBH and GRBs. Some of the old code is used, but much is rewritten. I will summarise the changes in the following sections

1. Running populations

Setting up multi-metallicity job arrays

Populations run can be setup using a new included command setup popsyn “file.ini”. This will create a set of scripts and a slurm_submit.sh file based on the provided ini file. For each metallicity, a job array with the population run will be submitted and a dependent job for merging the output of the job arrays.

Additional metadata

In the merge step, the total simulated mass and underlying mass is now calculated and stored in the combined file. Metallicity and number of systems is also stored.

Running population locally

This can still be done, but has been moved to a specific class called PopulationRunner.

2. Population analysis

Each class provides an interface to components in a population file. Only when the user requests specific data, does it get loaded into memory.

The Population class

The SyntheticPopulation class has been replaced with a Population class. This class can interface with the history and oneline dataframes in the file without loading all the data. Moreover, you can export a selection of binaries using your own selection criteria and use this to combine multiple metallicities together.

Furthermore, you can export any TransientPopuation using your own transient selection function. I’ve included one of BBHs and GRBs.

Note: If comes from an old population run, you can make it work in the new analysis framework by calling it once with the ini file and its metallicity, ie. Population(pop_file, metallicity=met_in_solar_units,ini_file). This calculated the metadata and afterwards you can just call the population like Population(pop_file).

To-Do:

TransientPopulation class

The df_synthetic has been replaced by a class, which is an interface with the transient population in the file. This is a similar dataframe as before, but requires time and metallicity columns to be present. The creation of the transient population is done using a function, which can be customised. This class inherits from the Population class. Plotting for delay time distributions and efficiencies over metallicities are possible.

Rates

This class is an interface of intrinsic and observable rates. It also allows for the creation of specific observability selection functions. It also allows plotting of histograms weighted by the SFH rates. This inherits from the TransientPopulation class.

transient_select_funcs.py

A new submodule which contains the default transient selection functions and the DCO detection probability calculation.

3. Tutorials

Due to the drastic change in the analysis framework, the tutorial required a rewrite too. This means that I reran some example populations, which will need to be replaced on either cluster.

The tutorials have all been updated. Additional I can add some more tutorials/notebooks for specific use cases, like those in the tutorial session.

4. Documentation

I've added documentation for the new classes and some on the SimulationProperties + BinaryPopulation.

astroJeff commented 7 months ago

Update (Feb 7): @maxbriel to present a tutorial/training session after spring break on this new functionality. There were a few comments on variable/function naming. More testing is needed. Unit tests are needed, too. We will wait to accept the PR until it has been widely tested/used.

astroJeff commented 5 months ago

Update (Apr 11): Pop synth tutorial to be led by @maxbriel today. We want others to test this and use it to generate populations, looking for bugs/problems.

mkruckow commented 5 months ago

I'd suggest to turn on the verbose output in the first tutorial of evolving 10 binaries. In this way a user not knowing about POSYDON can see more what is happening instead of getting no output at all on the poprun.evolve().

mkruckow commented 5 months ago

When you call pop.history_lengths, the name of the second column should be "length" instead of "index". At the moment both columns are called "index", which is confusing.

mkruckow commented 5 months ago

When you call pop.formation_channels you get the channel_debug and channel, what is the difference between them? And shouldn't we get channel_debug only if we are in a debug mode?

mkruckow commented 5 months ago

Is it possible to get more options to pop.export_selection(indices, 'selected.h5')? Because usually I'd like to select not on index but on other things like with the select function. Hence, wouldn't is be easy to add to the export_selection a way to pass it trough the select function to get the indices and use them for creating the selection file? Hence, one could use all the options of the select function in the export_selection as well.

mkruckow commented 5 months ago

It might be nice to get another column for a selected population, when calling selected.mass_per_metallicity, which contains the total mass of the selection.

mkruckow commented 5 months ago

In the rates.plot_hist_properties, we'd need to generate axis labels based on the chosen value to plot.

mkruckow commented 5 months ago

Just another small remark: may consider to rename setup-popsyn to posydon-setup-popsyn like we have it for running the grids and the pipeline.

mpgalleg commented 5 months ago

I tried to follow the tutorial and ran into an error when trying to calculate the observable population of BBH mergers using a population ran with NN. I did not run into this problem when using a population ran with IF.

ValueError: Input X contains NaN.
KNeighborsRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Seems like it is breaking when trying to predict points between the grid of sensitivity? @ZepeiX also had a similar error, but I do not know if it was with NN or IF population.

update: the problem is happening because some S1_spin and S2_spin are nan, which causes chi_eff values to be nan. Nan is provided to the function predict_pdet in selection_effects.py

It is happening to 142/41882 merging binary black holes. These BBH mergers are not specific to a particular formation channel or metallicity. They are coming from formation channels ‘ZAMS_oRLO1_CC1_oRLO2_CC2_END’, ‘ZAMS_oRLO1-contact_CC1_oRLO2_CC2_END’, ‘ZAMS_oRLO1_CC1_oRLO2_oCE2_CC2_END’, ‘ZAMS_CC1_oRLO2_oCE2_CC2_END’ with metallicities 0.1 , 0.01 , 0.001, 0.2.

mkruckow commented 5 months ago

I have seen an error in the logs, when running a population:

Traceback (most recent call last):
  File "/srv/astro/projects/posydon/popsyn_runs/software/POSYDON_shared_240419_max_synpop_population/posydon/popsyn/binarypopulation.py", line 492, in combine_saved_files
    store.append('oneline', oneline,
  File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 1317, in append
    self._write_to_group(
  File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 1858, in _write_to_group
    s.write(
  File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 4397, in write
    table = self._create_axes(
            ^^^^^^^^^^^^^^^^^^
  File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 4113, in _create_axes
    new_table.validate(self)
  File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 3446, in validate
    raise ValueError(
ValueError: invalid combination of [values_axes] on appending data [name->S2_m_disk_accreted,cname->S2_m_disk_accreted,dtype->float64,kind->float,shape->(1, 1000)] vs current table [name->S2_m_disk_accreted,cname->S2_m_disk_accreted,dtype->bytes24,kind->string,shape->None]

I think we simply missed to define the dtype of the m_disk_accreted columns and we get different automatic inferred ones. There might be more columns, which have no well defined dtypes.

astroJeff commented 5 months ago

Update (Apr 25): We discussed this PR. @mpgalleg to look through this PR. @dimsour94 and @ZepeiX to continue looking through PR.

One point brought up by @ZepeiX - it takes a long time to combine the pandas data frames at the end of the simulation (~1 hr per metallicity, but run in parallel). @ZepeiX and @mpgalleg to benchmark a population run of 10^6.

maxbriel commented 4 months ago

One point brought up by @ZepeiX - it takes a long time to combine the pandas data frames at the end of the simulation (~1 hr per metallicity, but run in parallel). @ZepeiX and @mpgalleg to benchmark a population run of 10^6.

This is something I mentioned before. The running of the population is optimised, but the merging is not. Essentially the more files you split your run into the longer the merging takes. However, the increase is disproportional to the change in parallel jobs. It should be fixed but is not something essential for this PR.

The main issue arises from pandas' implementation of appending to a DataFrame in an hdf5 file with format='table' and data_columns=True. This format is required to enable querying and to append dataframes to the file. However, appending drastically increases the duration, especially for wide dataframes, like POSYDON's. The data_columns=True is not implemented in the development branch but was intended, since discussions were had about selecting only specific binaries from file.

A baseline example:

import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0,100,size=(10000, 80)), columns=[f'col_{i}' for i in range(80)])
mode='a'

Single write

df.to_hdf('data.h5', key='df', mode=mode, format='table', data_columns=True)

Duration: 685 ms

Append write

df.to_hdf('data_append.h5', key='df', format='table', data_columns=True)
df.to_hdf('data_append.h5', append=True, key='df', format='table', data_columns=True)

Duration: 5.49 s

This is a 10x increase with just a single append. This becomes worse with every append. 2 appends takes ~19s, although not a 10x increase, it keep growing stronger than non-linearly (3 appends takes ~29s)

There are two directions to consider:

  1. To rework how we store these dataframes. This is quite a large time commitment and might not be worth it at this time. This is well outside the scope of this PR.
  2. Give the merge process a larger memory allowance and concat all dataframes in memory and perform a single write.

Here's an example for the concat + write:

Concat + write

df_out = pd.concat([df,df], axis=0, ignore_index=True, sort=False)
df_out.to_hdf('data_append.h5', key='df', format='table', data_columns=True)

Duration: 882 ms

This increase is less than with pandas' hdf append, but requires a larger memory footprint.

number of dfs time concat time to_hdf()
1 0.7s 0.7 s
2 0.8s 5s
3 1.4s 19s
4 1.7s 29s

memory usage

writing 3 dataframes to file of different sizes. The duration remains

df size mem concat peak mem to_hdf() peak time concat time to_hdf()
(100x80) 149.11 MiB 151 MiB 0.6s 17.5 s
(1000x80) 158.81 MiB 160.62 MiB MiB 0.6s 17.5 s
(10000x80) 309.72 MiB 339.59 MiB 1.37s 18.9 s
(100000x80) 901.38 MiB 605.34 MiB 19s 32.5
celiotine commented 4 months ago

Maybe I am missing something, but what exactly about the merging is different in this PR compared to the current development branch? I am finding that the merging completes in a reasonable time using the current development branch.

maxbriel commented 4 months ago

I've gone through the comments on this PR. They are either implemented or I've left a comment why not. If the issue did not arise from this PR, I created an issue describing the problem as best as i could

The major issue that came forward was the duration of merging files after a run and during exporting a selection.

Context The issue arises from pandas when data_columns=True is enables. When one appends to an HDF5 file with this, the append takes a very long time compared to the initial write. Probably due to creating a hash table. data_columns is required to query specific entries in the HDF5 file to not have to read in all the data. This was implemented in BinaryPopulation.save(), but not properly propagated through the code in the developement branch. On this branch, it was correctly propagated to the file merging, which gave rise to this issue. (@celiotine)

Solution(s) Several solutions that might work:

  1. Maximise the read-in and minimise the write-outs. This means loading in multiple files into memory till reaching a limit and writing to file. This requires much more code overhead to be achieved and becomes difficult when using export_selection.
  2. Restructuring how we store the data. This would be a major overhaul, given how embeded the dataframes are.
  3. Disabling data_columns. This means no querying of the data on disk is allowed, except for the string columns. This removed functionality of the select function, but drastically increases the performance of IO.

I've decided to go with 3. This removes some querying, but some are still available. If we would like to improve upon this, or allow for custom data_columns, as an input, this can be noted as an improvement for a later date.

maxbriel commented 4 months ago

This PR is ready to be merged.

I've added doc strings, documentation and changed the tutorials to work with the new framework.

Additionally this solves several issues/enhancements:

  1. 179 All analysis is done in a single file.

  2. 184 A V1 reprocessed data warning is now-present.

  3. 198 Processing can happen in chunks now.

  4. 282 corrected the normalisation.

  5. 283 spin_orbit_tilt_first_SN and spin_orbit_tilt_second_SN are now part of the default output file.

  6. 285 Simulated mass calculated using oneline masses.

  7. 321 by using np.asarray()

  8. 323 with an extra check to make sure the initial_sample file is not required for a sampled run.