Feature: Implement more flexible control over data dimensionality

flo-schu commented 9 months ago

Currently pymob only supports datasets, where variables all have the same dimensionality. For the case-study reversible_damage. This is not the case, when more substances are included simultaneously. This becomes even less feasible when datasets with potentially 100-10,000s of gene expression signals are included. Here more granular control is needed, which data variables have what dimensionality.

Ad hoc this should be solved with a workaround that is implemented in the solver, which is specific to the problem of reversible_damage, because it involves major breaking changes to the current API, but in general it would be desirable to have this capability.

The implementation should be rebased on #2, because it can well use the new config API and the other refactorings, which have been implemented.

[ ] refactor scaling. A new scaler for each variable and potentially subdimension should be implemented. Proposed solution: Use common dimensions for scaling. E.g. always scale over id and time and create a scaler for each coordinate in the remaining dimensions
[x] provide new setting for dataset.dimensions and dataset.dimension.coordinates. These in addition provide fine control over the data points to be included in the analysis.
[x] Implement tracking of hidden variables

flo-schu commented 9 months ago

Suggestion for cfg files

[simulation]
input_files = params.json
dimensions = id substance time
modeltype = stochastic
# exclude 2nd Aulhorn experiment
substance_range = 0 inf
apical_effect = lethal
hpf = 24
data_variables = cext cint nrf2 lethality
data_variables_max = nan nan nan 1
data_variables_min = 0 0 0 0
seed = 1

[dataset.dimensions]
# describe the dimensionality of the dataset. This setting is essential to 
# automatically assemble, scale and compare simulation and observation datasets.
cext = id substance time
cint = id substance time
nrf2 = id time
lethlity = id time

[dataset.dimension.coordinates]
# optionally give the coordinates of the dimensions. This setting also 
# modifies which datapoints of the dataset will be used for comparison
# e.g. time=24 will only include observations after 24 h in the dataset.
substance = diuron diclofenac naproxen

flo-schu commented 7 months ago

This also addresses the problem that currently Simulation class cannot be used anymore for a vanilla simulation without data !!!!

The problem is that too many methods used in __init__ were developed under the reversible-damage project branch.

Consider using only a minimal init and rather defining methods to deliver the desired function. Such as

derive_dimensionality_from_data
derive_coordinates_from_data

instead of having to specify it manually.

REMEMBER! It should always be easy to use the tool.

flo-schu commented 2 months ago

coordinates are currently not specified via the config backend, but are are extracted from the data.

flo-schu / pymob

Feature: Implement more flexible control over data dimensionality #6