Port SmokePlume dataset generator to avoid test leakage

dgcnz commented 1 month ago

Description

Current dataset contains 40 timesteps, 30 for training 10 for validation and the validation set is reused as test set. This can be problematic as early stopping relies on validation metrics and test leakage is occuring.

Although this task is important for best reproduction, we can use in the meantime the original data produced, thus this doesn't hold to a P0 task.

Tasks

[ ] Port Wang 2022 data_gen into a script that takes as input at least the following attributes:
- total_timesteps (currently 40) - train_split (currently 0.75) - val_split (currently 0.25) - test_split (currently 0) - equivariance_level - see original data_gen notebook for other possible parameters

You can base your script on jhtdb.generate but making a huggingface loading script is not required and possibly not advisable given the time constraints.

Resources

dgcnz commented 1 month ago

@Nesta-gitU I think that it would be useful to brief @MeneerTS on this dataset generation at some point, it would be useful that at least one of us understands what really represents the data (ref: https://github.com/dgcnz/dl2/issues/36#issuecomment-2105706045)

Maybe this helps in the meantime (https://www.youtube.com/watch?v=KMfcF9XvVio)

Nesta-gitU commented 1 month ago

I think its done

dgcnz / dl2