HotspotStoplight / Climate

Apache License 2.0
0 stars 1 forks source link

Two levels of stratified sampling? #51

Closed nlebovits closed 1 week ago

nlebovits commented 7 months ago

See if it's possible to double the levels of stratified sampling so we're taking an equal number of samples of 1) flooded and unflooded pixels and 2) each land cover class

nlebovits commented 7 months ago

Yep--it's called hierarchical stratified sampling. See this from ChatGPT:

Yes, it is possible to have multiple levels to the sampling process in your machine learning workflow on Google Earth Engine (GEE) to ensure that you take equal samples of flooded and unflooded pixels from each type of land cover within the bounding box. This approach is often referred to as stratified sampling with multiple variables or hierarchical stratified sampling. The idea is to not only stratify by the flood condition but also by different types of land cover, ensuring a balanced representation of each category in your training dataset.

Here's an outline of how you might implement this:

Prepare Your Data: Ensure your dataset includes labels for both flooding condition (e.g., flooded, unflooded) and land cover type (e.g., forest, urban, agricultural).

Define Your Strata: You will need to define strata that are combinations of flood condition and land cover type. For instance, your strata might be "flooded forest," "unflooded forest," "flooded urban," and so on.

Sampling:
    For each stratum, use the .stratifiedSample() method in GEE to select an equal number of pixels or points. This ensures that each of your final strata has the same representation in your sample.
    It's crucial to define a unique identifier for each stratum, which could be a combination of the flood condition and land cover type codes or labels.

Merge Samples: After sampling from each stratum, merge the samples back into a single dataset that you'll use for training your Random Forest (RF) model.

Training: Use the stratified and balanced sample to train your RF model.

Here's a simplified code snippet to illustrate the concept:

import ee
ee.Initialize()

# Assuming you have a feature collection with flood status and land cover
feature_collection = ee.FeatureCollection('your_feature_collection_path')

# List of flood conditions and land cover types
flood_conditions = ['flooded', 'unflooded']
land_cover_types = ['forest', 'urban', 'agricultural']

# Number of samples per stratum
samples_per_stratum = 100

# Container for sampled points
sampled_points = ee.FeatureCollection([])

for flood_condition in flood_conditions:
    for land_cover_type in land_cover_types:
        # Define the stratum
        stratum = feature_collection.filter(ee.Filter.equals('flood_status', flood_condition))\
                                    .filter(ee.Filter.equals('land_cover', land_cover_type))

        # Stratified sampling within the stratum
        stratum_samples = stratum.stratifiedSample(
            numPoints=samples_per_stratum,
            classBand='your_class_band', # Adjust as necessary
            region=your_bounding_box, # Define your bounding box
            scale=your_scale, # Set your scale
            seed=your_seed # Optional seed for reproducibility
        )

        # Merge the samples
        sampled_points = sampled_points.merge(stratum_samples)

# Proceed with your training using 'sampled_points'