AstroPile / FlatironMeeting2024

AstroPile meet-up at the Flatiron Institute
https://astropile.github.io/FlatironMeeting2024/
MIT License
2 stars 3 forks source link

[Infrastructure] ClusterSampler meta class for data loaders #25

Open MilesCranmer opened 3 months ago

MilesCranmer commented 3 months ago

Your Hack Title

A sampler class for sampling multiple sources according to some user-defined metric.

Contacts: @MilesCranmer @henrysky @maja-jablonska Participants: @MilesCranmer @henrysky @maja-jablonska

Goals and deliverable

Create a type of data loader that can be used for sampling AstroPile datasets in clusters of sources. This would be particularly useful for Gaia-like datasets where an individual source (like a single star) is not as interesting as a cluster of stars (like a stellar stream).

Resources needed

Detailed description

Related:

MilesCranmer commented 3 months ago

Use case: stellar streams.

We want to be able to sample Gaia with some phase-space locality (i.e., position x velocity), as this would give us a way to have windows over stellar stream stars.

MilesCranmer commented 3 months ago

Use case: population-level analysis.

As suggested by @henrysky we want to be able to sample all sources in Gaia of a particular class for population-level inference.

MilesCranmer commented 3 months ago

Use case: sub-population analysis.

We want to be able to sample all sources of a particular subpopulation in the Milky Way, based on proximity in inferred stellar age + metallicity + orbital actions.

maja-jablonska commented 3 months ago

Use case: classification (depending on the dataset), for example if spectral dataset has a spectral classification.

MilesCranmer commented 3 months ago

Idea for how to structure the sampler:

  1. The user would define a distance (either a metric function or matrix) and pass this to a ClusterSampler class.
  2. They would pass this ClusterSampler instance to the DataLoader (@lhparker1).
  3. The ClusterSampler would build an approx nearest-neighbor tree over elements of the dataset using the distance metric. This tree would be used for quick nearest-neighbor lookups.
  4. The sampler would, for a single mini-batch, work by first sampling a single source from Gaia. From this single source, you would collect the $k$-nearest sources from the dataset, based on the cached approximate nearest-neighbors. $k$ would be chosen from some user-specified mean, using a Poisson distribution.

Downsides of this: