Closed maxbriel closed 3 months ago
Update (Feb 7): @maxbriel to present a tutorial/training session after spring break on this new functionality. There were a few comments on variable/function naming. More testing is needed. Unit tests are needed, too. We will wait to accept the PR until it has been widely tested/used.
Update (Apr 11): Pop synth tutorial to be led by @maxbriel today. We want others to test this and use it to generate populations, looking for bugs/problems.
I'd suggest to turn on the verbose output in the first tutorial of evolving 10 binaries. In this way a user not knowing about POSYDON can see more what is happening instead of getting no output at all on the poprun.evolve()
.
When you call pop.history_lengths
, the name of the second column should be "length" instead of "index". At the moment both columns are called "index", which is confusing.
When you call pop.formation_channels
you get the channel_debug
and channel
, what is the difference between them? And shouldn't we get channel_debug
only if we are in a debug mode?
Is it possible to get more options to pop.export_selection(indices, 'selected.h5')
? Because usually I'd like to select not on index but on other things like with the select
function. Hence, wouldn't is be easy to add to the export_selection
a way to pass it trough the select
function to get the indices
and use them for creating the selection file? Hence, one could use all the options of the select
function in the export_selection
as well.
It might be nice to get another column for a selected population, when calling selected.mass_per_metallicity
, which contains the total mass of the selection.
In the rates.plot_hist_properties
, we'd need to generate axis labels based on the chosen value to plot.
Just another small remark: may consider to rename setup-popsyn
to posydon-setup-popsyn
like we have it for running the grids and the pipeline.
I tried to follow the tutorial and ran into an error when trying to calculate the observable population of BBH mergers using a population ran with NN. I did not run into this problem when using a population ran with IF.
ValueError: Input X contains NaN.
KNeighborsRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
Seems like it is breaking when trying to predict points between the grid of sensitivity? @ZepeiX also had a similar error, but I do not know if it was with NN or IF population.
update: the problem is happening because some S1_spin and S2_spin are nan, which causes chi_eff values to be nan. Nan is provided to the function predict_pdet in selection_effects.py
It is happening to 142/41882 merging binary black holes. These BBH mergers are not specific to a particular formation channel or metallicity. They are coming from formation channels ‘ZAMS_oRLO1_CC1_oRLO2_CC2_END’, ‘ZAMS_oRLO1-contact_CC1_oRLO2_CC2_END’, ‘ZAMS_oRLO1_CC1_oRLO2_oCE2_CC2_END’, ‘ZAMS_CC1_oRLO2_oCE2_CC2_END’ with metallicities 0.1 , 0.01 , 0.001, 0.2.
I have seen an error in the logs, when running a population:
Traceback (most recent call last):
File "/srv/astro/projects/posydon/popsyn_runs/software/POSYDON_shared_240419_max_synpop_population/posydon/popsyn/binarypopulation.py", line 492, in combine_saved_files
store.append('oneline', oneline,
File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 1317, in append
self._write_to_group(
File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 1858, in _write_to_group
s.write(
File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 4397, in write
table = self._create_axes(
^^^^^^^^^^^^^^^^^^
File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 4113, in _create_axes
new_table.validate(self)
File "/srv/astro/projects/posydon/popsyn_runs/software/conda_env_shared_240419/lib/python3.11/site-packages/pandas/io/pytables.py", line 3446, in validate
raise ValueError(
ValueError: invalid combination of [values_axes] on appending data [name->S2_m_disk_accreted,cname->S2_m_disk_accreted,dtype->float64,kind->float,shape->(1, 1000)] vs current table [name->S2_m_disk_accreted,cname->S2_m_disk_accreted,dtype->bytes24,kind->string,shape->None]
I think we simply missed to define the dtype of the m_disk_accreted
columns and we get different automatic inferred ones. There might be more columns, which have no well defined dtypes.
Update (Apr 25): We discussed this PR. @mpgalleg to look through this PR. @dimsour94 and @ZepeiX to continue looking through PR.
One point brought up by @ZepeiX - it takes a long time to combine the pandas data frames at the end of the simulation (~1 hr per metallicity, but run in parallel). @ZepeiX and @mpgalleg to benchmark a population run of 10^6.
One point brought up by @ZepeiX - it takes a long time to combine the pandas data frames at the end of the simulation (~1 hr per metallicity, but run in parallel). @ZepeiX and @mpgalleg to benchmark a population run of 10^6.
This is something I mentioned before. The running of the population is optimised, but the merging is not. Essentially the more files you split your run into the longer the merging takes. However, the increase is disproportional to the change in parallel jobs. It should be fixed but is not something essential for this PR.
The main issue arises from pandas' implementation of appending to a DataFrame in an hdf5 file with format='table'
and data_columns=True
. This format is required to enable querying and to append dataframes to the file.
However, appending drastically increases the duration, especially for wide dataframes, like POSYDON's. The data_columns=True
is not implemented in the development branch but was intended, since discussions were had about selecting only specific binaries from file.
A baseline example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0,100,size=(10000, 80)), columns=[f'col_{i}' for i in range(80)])
mode='a'
Single write
df.to_hdf('data.h5', key='df', mode=mode, format='table', data_columns=True)
Duration: 685 ms
Append write
df.to_hdf('data_append.h5', key='df', format='table', data_columns=True)
df.to_hdf('data_append.h5', append=True, key='df', format='table', data_columns=True)
Duration: 5.49 s
This is a 10x increase with just a single append. This becomes worse with every append. 2 appends takes ~19s, although not a 10x increase, it keep growing stronger than non-linearly (3 appends takes ~29s)
There are two directions to consider:
Here's an example for the concat + write:
Concat + write
df_out = pd.concat([df,df], axis=0, ignore_index=True, sort=False)
df_out.to_hdf('data_append.h5', key='df', format='table', data_columns=True)
Duration: 882 ms
This increase is less than with pandas' hdf append, but requires a larger memory footprint.
number of dfs | time concat | time to_hdf() |
---|---|---|
1 | 0.7s | 0.7 s |
2 | 0.8s | 5s |
3 | 1.4s | 19s |
4 | 1.7s | 29s |
writing 3 dataframes to file of different sizes. The duration remains
df size | mem concat peak | mem to_hdf() peak | time concat | time to_hdf() |
---|---|---|---|---|
(100x80) | 149.11 MiB | 151 MiB | 0.6s | 17.5 s |
(1000x80) | 158.81 MiB | 160.62 MiB MiB | 0.6s | 17.5 s |
(10000x80) | 309.72 MiB | 339.59 MiB | 1.37s | 18.9 s |
(100000x80) | 901.38 MiB | 605.34 MiB | 19s | 32.5 |
Maybe I am missing something, but what exactly about the merging is different in this PR compared to the current development branch? I am finding that the merging completes in a reasonable time using the current development branch.
I've gone through the comments on this PR. They are either implemented or I've left a comment why not. If the issue did not arise from this PR, I created an issue describing the problem as best as i could
The major issue that came forward was the duration of merging files after a run and during exporting a selection.
Context
The issue arises from pandas
when data_columns=True
is enables. When one appends to an HDF5 file with this, the append takes a very long time compared to the initial write. Probably due to creating a hash table. data_columns
is required to query specific entries in the HDF5 file to not have to read in all the data. This was implemented in BinaryPopulation.save()
, but not properly propagated through the code in the developement
branch.
On this branch, it was correctly propagated to the file merging, which gave rise to this issue. (@celiotine)
Solution(s) Several solutions that might work:
export_selection
.data_columns
. This means no querying of the data on disk is allowed, except for the string columns. This removed functionality of the select
function, but drastically increases the performance of IO.I've decided to go with 3. This removes some querying, but some are still available. If we would like to improve upon this, or allow for custom data_columns, as an input, this can be noted as an improvement for a later date.
This PR is ready to be merged.
I've added doc strings, documentation and changed the tutorials to work with the new framework.
Additionally this solves several issues/enhancements:
np.asarray()
This address a few issues with the
SyntheticPopulation
class I ran into while selecting and analysis a different (specifically larger) population than the BBH population. (This PR replaces PR #200 )This PR drastically changes the workflow of the population analysis to provide a large flexibility for the user, ie. being able to look at other populations than the BBH and GRBs. Some of the old code is used, but much is rewritten. I will summarise the changes in the following sections
1. Running populations
Setting up multi-metallicity job arrays
Populations run can be setup using a new included command
setup popsyn “file.ini”
. This will create a set of scripts and aslurm_submit.sh
file based on the provided ini file. For each metallicity, a job array with the population run will be submitted and a dependent job for merging the output of the job arrays.Additional metadata
In the merge step, the total simulated mass and underlying mass is now calculated and stored in the combined file. Metallicity and number of systems is also stored.
Running population locally
This can still be done, but has been moved to a specific class called
PopulationRunner
.2. Population analysis
Each class provides an interface to components in a population file. Only when the user requests specific data, does it get loaded into memory.
The Population class
The SyntheticPopulation class has been replaced with a Population class. This class can interface with the history and oneline dataframes in the file without loading all the data. Moreover, you can export a selection of binaries using your own selection criteria and use this to combine multiple metallicities together.
Furthermore, you can export any
TransientPopuation
using your own transient selection function. I’ve included one of BBHs and GRBs.Note: If comes from an old population run, you can make it work in the new analysis framework by calling it once with the ini file and its metallicity, ie.
Population(pop_file, metallicity=met_in_solar_units,ini_file)
. This calculated the metadata and afterwards you can just call the population likePopulation(pop_file)
.To-Do:
overwrite=True
to theexport_selection
function to make sure that the user has to choose to add to a file, instead of appending.TransientPopulation class
The df_synthetic has been replaced by a class, which is an interface with the transient population in the file. This is a similar dataframe as before, but requires
time
andmetallicity
columns to be present. The creation of the transient population is done using a function, which can be customised. This class inherits from thePopulation
class. Plotting for delay time distributions and efficiencies over metallicities are possible.Rates
This class is an interface of intrinsic and observable rates. It also allows for the creation of specific observability selection functions. It also allows plotting of histograms weighted by the SFH rates. This inherits from the TransientPopulation class.
transient_select_funcs.py
A new submodule which contains the default transient selection functions and the DCO detection probability calculation.
3. Tutorials
Due to the drastic change in the analysis framework, the tutorial required a rewrite too. This means that I reran some example populations, which will need to be replaced on either cluster.
The tutorials have all been updated. Additional I can add some more tutorials/notebooks for specific use cases, like those in the tutorial session.
4. Documentation
I've added documentation for the new classes and some on the
SimulationProperties
+BinaryPopulation
.