Valentyn1997 / CausalTransformer

Code for the paper "Causal Transformer for Estimating Counterfactual Outcomes"
MIT License
95 stars 21 forks source link

Help with data structures in a dictionary produced by SyntheticCancerDataset #9

Open linaske opened 1 month ago

linaske commented 1 month ago

Good day @Valentyn1997 ,

I am priviliged to explore your excellent paper and its implementation for my thesis work!

My current aim is to transform Tumor Growth dataset into a tabular format so that I can use it in the training of another model. However, I struggle to comprehend data structures that are produced by an instance of SyntheticCancerDataset.

For example, when I run a simple snippet like this:

import pandas as pd
import numpy as np
from src.data.cancer_sim.dataset import SyntheticCancerDatasetCollection
from src.data.cancer_sim.dataset import SyntheticCancerDataset

# Define the parameters
chemo_coeff = 0.5
radio_coeff = 0.5
num_patients = 10
seed = 5
window_size = 15
seq_length = 10
subset_name = 'train'
mode = 'factual'
projection_horizon = 10
lag = 0
cf_seq_mode = 'sliding_treatment'
treatment_mode = 'multiclass'

# Create an instance of the class
df = SyntheticCancerDataset(
    chemo_coeff,
    radio_coeff,
    num_patients,
    window_size,
    seq_length,
    subset_name,
    mode,
    projection_horizon,
    seed,
    lag,
    cf_seq_mode,
    treatment_mode
)

scaling_params = df.get_scaling_params()
df.process_data(scaling_params)

# Get the data for the first patient
first_patient_data = df[0]
print(first_patient_data)

I get a dictionary with multiple arrays of a different length:

Could you help me understand why some arrays have 10 items, whereas other only 9? Similarly, could you give me pointers how to transform this simple dictionary with data for one patient to a tabular format? I am mainly interested in one-hot encoded covariates for historical radio/chemo application and historical tumour volume.

Thank you very much in advance!

angeruzzi commented 4 weeks ago

Hello, I also had this doubt, about the difference between the lenghts of the series. Another point, that is in default parameters some series are generated with more dimensions, for example the prev_treatments and current_treatments have 59 x 4 dimensions , different from others which are single series.