CIGLR-ai-lab / GreatLakes-TempSensors

Collaborative repository for optimizing the placement of temperature sensors in the Great Lakes using the DeepSensor machine learning framework. Aiming to enhance the quantitative understanding of surface temperature variability for better environmental monitoring and decision-making.
MIT License
0 stars 0 forks source link

Bug Report: Handling Latitude-Longitude Pairs in `DataProcessor` #27

Closed DaniJonesOcean closed 2 months ago

DaniJonesOcean commented 2 months ago

Bug Report: Handling Latitude-Longitude Pairs in DataProcessor

Description

There seems to be an issue when processing a dataset of simple latitude and longitude (lat-lon) pairs using DataProcessor. The DataProcessor expects additional data columns, and setting an index on just these coordinate columns results in an empty DataFrame. This creates confusion when handling scenarios where users only want to manage spatial location data without additional variables.

Steps to Reproduce

  1. Create a DataFrame of Lat-Lon Pairs:

    import pandas as pd
    
    # Sample data
    lat = [48.061, 45.344, 45.351, 47.585, 41.677]
    lon = [-87.793, -86.411, -82.840, -86.585, -82.398]
    
    # Creating DataFrame
    buoy_df = pd.DataFrame(data={'lat': lat, 'lon': lon})
    print("Initial DataFrame:\n", buoy_df)
  2. Set Index and Attempt Processing:

    # Setting index
    buoy_df.set_index(['lat', 'lon'], inplace=True)
    print("DataFrame with set index:\n", buoy_df)
    
    # Initialize DataProcessor
    from deepsensor.data import DataProcessor
    data_processor = DataProcessor(x1_name="lat", x2_name="lon")
    
    # Process DataFrame
    buoy_ds = data_processor(buoy_df)
    print("Processed DataFrame:\n", buoy_ds)

Expected Behavior

The DataProcessor should handle simple lat-lon pairs gracefully without requiring additional "dummy" variables or resulting in an empty DataFrame when only spatial coordinates are provided.

Actual Behavior

The process fails unless a data column is introduced. Setting the index on lat-lon pairs with no additional data results in an empty DataFrame, causing confusion and unnecessary complexity.

Suggested Solution

  1. Option to Handle Lat-Lon Pairs Directly:

    • Implement a method or mode within DataProcessor for handling datasets consisting solely of spatial coordinate data.
  2. Improved Error Messaging:

    • Provide more comprehensive error messages guiding users when data structure issues are detected.

Example Fix Using xarray (Optional Normalization)

import pandas as pd
import xarray as xr
from deepsensor.data import DataProcessor

# Sample lat-lon data
lat = [48.061, 45.344, 45.351, 47.585, 41.677]
lon = [-87.793, -86.411, -82.840, -86.585, -82.398]

# Create DataFrame
buoy_df = pd.DataFrame(data={'lat': lat, 'lon': lon})

# Print Initial DataFrame
print("Initial DataFrame:\n", buoy_df)

# Convert to xarray Dataset
buoy_ds = buoy_df.set_index(['lat', 'lon']).to_xarray()

# Print xarray Dataset
print("xarray Dataset:\n", buoy_ds)

# Initialize DataProcessor
data_processor = DataProcessor(x1_name="lat", x2_name="lon")

# Process using DataProcessor
buoy_normalized = data_processor(buoy_ds, method="mean_std")

# Print Processed Data
print("Processed Data:\n", buoy_normalized)

Additional Context

This flexibility is essential for various applications where users only need to manage spatial locations without additional metrics, ensuring a more streamlined and user-friendly processing interface.

Requested Action

Please review this behavior to enhance support for simpler datasets, ensuring that DataProcessor can handle pure lat-lon spatial data gracefully. Consider enhancing error messages and providing better documentation to help users understand necessary data structures better.

DaniJonesOcean commented 2 months ago

@eredding02 As discussed earlier, I think this one is related to our new mask approach. I think it actually makes sense that DeepSensor can't do much with a list of lat/lon points - it probably needs to be formulated as a density channel!

Closing and redirecting to this issue, which I think is the logical next step:

https://github.com/CIGLR-ai-lab/GreatLakes-TempSensors/issues/35