Reorganize the code structure to match the distinct functionalities

mauriciogtec commented 1 year ago

This is a summary of desired codebase functionality:

Benchmark dataset (user side): It contains utils to generate new evaluation benchmark datasets. This task requires downloading the core datasets from Dataverse---including their metadata---and combining them with a ``sampling model". Thus, its key components are:
- Dataverse API: It contains the code to download the core datasets from Dataverse and their metadata. Datasets contain covariates $X$, exposure $A$, and counterfactual predictions. The latter don't include the sampling variability or "error" in $Y$. The API should know what to download based on user specifications and a metadata master file.
- Sampling models: It defines a model for the sampling variability/error in $Y$. For example, the repo currently supports a Gaussian Process error.
- Dataset generation: It combines the core datasets with the sampling model to generate new benchmark datasets.
- A master file should be created to keep track of the core datasets and their metadata. This file should be updated as new datasets are added to the repo.
Evaluation Metrics and Tools: Tools for the user to evaluate the performance of a spatial confounding algorithm in a standardized way. It needs to define evaluation metrics. Some examples are:
- For the binary case: Bias, MSE, PEHE, Variance
- For the continuous case: similar metrics but integrated along the curve.
- Standard error computations
Training counterfactual/synthetic models. As detailed in the paper, we plan to support various methods to generate synthetic datasets (various types of NNs, forests, and ensembles). This code is not required from the user side, but it must be available for reproducibility purposes. The installation of requires dependencies must be optional. It would also be nice to have an API that users can use to train their own synthetic models.
Spatial confounding algorithms: Here implementations of various algorithms for resolving spatial confounding are implemented using a common interface.

In accordance with the above, here's an example of how the could be reorganized:

package_name/
├── __init__.py
│
├── datasets/
│   ├── __init__.py
│   ├── dataverse_api.py
│   ├── sampling_models.py
│   ├── user_api.py  (e.g., to be imported in __init__.py)
│   └── metadata_master.yaml
│
├── evaluation/
│   ├── __init__.py
│   ├── binary_metrics.py
│   ├── continuous_metrics.py
│
├── synthetic_generation/
│   ├── __init__.py
│   ├── models.py 
│   └── training_api.py
│
├── algorithms/
│   ├── __init__.py
│   ├── algorithm1.py
│   ├── algorithm2.py
│   └── algorithm3.py
│
├── setup.py
├── LICENSE
└── README.md

Naeemkh commented 1 year ago

@mauriciogtec, this is great. I just want to add one more point, all source codes' folders should be under another folder with the same name as the package name.

Here is the modified version:

.
└── packagename
    ├── LICENSE
    ├── README.md
    ├── packagename
    │   ├── __init__.py
    │   ├── algorithms
    │   │   ├── __init__.py
    │   │   ├── algorithm1.py
    │   │   ├── algorithm2.py
    │   │   └── algorithm3.py
    │   ├── datasets
    │   │   ├── __init__.py
    │   │   ├── dataverse_api.py
    │   │   ├── metadata_master.yaml
    │   │   ├── sampling_models.py
    │   │   └── user_api.py
    │   ├── evaluation
    │   │   ├── __init__.py
    │   │   ├── binary_metrics.py
    │   │   └── continues_metrics.py
    │   └── synthetic_generation
    │       ├── __init__.py
    │       ├── models.py
    │       └── training_api.py
    └── setup.py

Distribution should include one module at a time.

katehu commented 1 year ago

maybe have a folder called tools including for example, visualization tools?

katehu commented 1 year ago

probably should add a requirement.txt file too.

NSAPH-Projects / space

Reorganize the code structure to match the distinct functionalities #29