Closed dgcnz closed 1 month ago
@Nesta-gitU I think that it would be useful to brief @MeneerTS on this dataset generation at some point, it would be useful that at least one of us understands what really represents the data (ref: https://github.com/dgcnz/dl2/issues/36#issuecomment-2105706045)
Maybe this helps in the meantime (https://www.youtube.com/watch?v=KMfcF9XvVio)
I think its done
Description
Current dataset contains 40 timesteps, 30 for training 10 for validation and the validation set is reused as test set. This can be problematic as early stopping relies on validation metrics and test leakage is occuring.
Although this task is important for best reproduction, we can use in the meantime the original data produced, thus this doesn't hold to a P0 task.
Tasks
total_timesteps
(currently 40) -train_split
(currently 0.75) -val_split
(currently 0.25) -test_split
(currently 0) -equivariance_level
- see original data_gen notebook for other possible parametersYou can base your script on jhtdb.generate but making a huggingface loading script is not required and possibly not advisable given the time constraints.
Resources