carbonscott / maxie

Masked Autoencoder for X-ray Image Encoding (MAXIE)
Other
1 stars 4 forks source link

Dataset loading for eval #7

Open b-crouch opened 1 month ago

b-crouch commented 1 month ago

Addresses the issue of data organization during model eval (discussed in Slack) via two changes:

  1. Stratifies the train-test split when generating input JSONs (generate_dataset_in_json.py) such that the training and validation sets both have the same distribution of detector types. This avoids the case where the training and validation datasets have fundamentally different structure (e.g. significantly more Rayonix samples in one set than the other)
  2. Randomizes the order in which all events across all experiments are yielded by the IPCDistributedSegmentedDataset entry generator (i.e. it is no longer the case that all events from an experiment run appear consecutively). This avoids the situation where a sampler only samples events from the same experiment/run, and also introduces more diversity in how many events from the early/middle/late stages of a run appear in a dataset sample
carbonscott commented 1 month ago

Thanks for the PR.

Shuffling the entire dataset in memory should work, but it might get trickier when dataset size grows.

Perhaps round robin scheduling (https://en.wikipedia.org/wiki/Round-robin_scheduling) would help us in this case. Basically, we go through up to certain number of examples for each exp and then move on to the next exp. Then, repeat this cycle until all generators have been exhausted.