MilesCranmer commented 3 months ago

Your Hack Title

A sampler class for sampling multiple sources according to some user-defined metric.

Contacts: @MilesCranmer @henrysky @maja-jablonska Participants: @MilesCranmer @henrysky @maja-jablonska

Goals and deliverable

Create a type of data loader that can be used for sampling AstroPile datasets in clusters of sources. This would be particularly useful for Gaia-like datasets where an individual source (like a single star) is not as interesting as a cluster of stars (like a stellar stream).

Resources needed

Small subset of Gaia for testing our cluster.
List of potential use-cases (both galactic science and extragalactic).

Detailed description

Torch geometric samplers: https://pytorch-geometric.readthedocs.io/en/latest/modules/sampler.html

MilesCranmer commented 3 months ago

Use case: stellar streams.

We want to be able to sample Gaia with some phase-space locality (i.e., position x velocity), as this would give us a way to have windows over stellar stream stars.

MilesCranmer commented 3 months ago

Use case: population-level analysis.

As suggested by @henrysky we want to be able to sample all sources in Gaia of a particular class for population-level inference.

MilesCranmer commented 3 months ago

Use case: sub-population analysis.

We want to be able to sample all sources of a particular subpopulation in the Milky Way, based on proximity in inferred stellar age + metallicity + orbital actions.

maja-jablonska commented 3 months ago

Use case: classification (depending on the dataset), for example if spectral dataset has a spectral classification.

MilesCranmer commented 3 months ago

Idea for how to structure the sampler:

The user would define a distance (either a metric function or matrix) and pass this to a ClusterSampler class.
They would pass this ClusterSampler instance to the DataLoader (@lhparker1).
The ClusterSampler would build an approx nearest-neighbor tree over elements of the dataset using the distance metric. This tree would be used for quick nearest-neighbor lookups.
- Prefer to use https://github.com/spotify/annoy which is a fast and easy-to-use approximate nearest-neighbor.
  - See https://github.com/erikbern/ann-benchmarks for a list of more approximate-nearest-neighbor tools.
The sampler would, for a single mini-batch, work by first sampling a single source from Gaia. From this single source, you would collect the $k$-nearest sources from the dataset, based on the cached approximate nearest-neighbors. $k$ would be chosen from some user-specified mean, using a Poisson distribution.

Downsides of this:

I think you might want to allow for a slightly randomized distance metric each time. I'm not sure how this would work. We can figure that out later.

AstroPile / FlatironMeeting2024

[Infrastructure] ClusterSampler meta class for data loaders #25

Your Hack Title

Goals and deliverable

Resources needed

Detailed description