Labic-ICMC-USP / CLIMADA-BR

Python (3.8+) version of CLIMADA
GNU General Public License v3.0
0 stars 0 forks source link

LLM-based Hazard Regularization using GNNs #11

Open rmarcacini opened 12 hours ago

rmarcacini commented 12 hours ago

Objective: In the previous dataset, severity levels were directly estimated using a Large Language Model (LLM). While the LLM performs well in distinguishing cases with extreme severity values, it produces less accurate results in situations where semantic overlap occurs between events.

To improve accuracy, we propose using only the events with extreme severity values (where the LLM is more reliable) and leveraging the relationships between events to create a more regularized dataset. After this process, we will filter for high-severity events only and use them as input for CLIMADA-BR. This approach aims to enhance data quality and ensure that only the most critical events are considered in climate impact analysis.

Use the dengue dataset to obtain a regularized event dataset by constructing a graph and applying a Graph Convolutional Network (GCN). This process will use SBERT for feature extraction, NetworkX and Scikit-Learn for graph construction, and GCN to classify intermediate severity levels.


Steps

  1. Feature Extraction with SBERT:

    • Load the dengue CSV dataset from the previous step.
    • For each event, generate a feature vector using Sentence-BERT (SBERT).
    • Example code:

      from sentence_transformers import SentenceTransformer
      import pandas as pd
      
      # Load dengue data
      df = pd.read_csv('dengue_data.csv')
      
      # Initialize SBERT model
      model = SentenceTransformer('all-MiniLM-L6-v2')
      
      # Generate feature vectors for each event
      df['features'] = df['event_description'].apply(lambda x: model.encode(x))
  2. Graph Construction:

    • Create a graph using NetworkX, where each node is an event with an SBERT-generated feature vector.
    • Connect nodes based on feature similarity using Scikit-Learn’s NearestNeighbors.
    • Example code:

      import networkx as nx
      from sklearn.neighbors import NearestNeighbors
      import numpy as np
      
      # Convert feature vectors to numpy array
      X = np.vstack(df['features'].values)
      
      # Use Nearest Neighbors to find connections
      neighbors = NearestNeighbors(n_neighbors=5, metric='cosine').fit(X)
      distances, indices = neighbors.kneighbors(X)
      
      # Build the graph
      G = nx.Graph()
      for idx, neighbors in enumerate(indices):
       for neighbor in neighbors:
           if idx != neighbor:  # avoid self-loops
               G.add_edge(idx, neighbor)
  3. Label Assignment:

    • Label nodes based on the severity of each event:
      • Positive Label: High severity values.
      • Negative Label: Low severity values.
      • No Label: Intermediate severity values.
    • Example code:

      # Define thresholds for positive and negative labels (example, adjust intervals)
      high_severity_threshold = df['severity'].quantile(0.9)
      low_severity_threshold = df['severity'].quantile(0.1)
      
      labels = {}
      for idx, severity in enumerate(df['severity']):
       if severity >= high_severity_threshold:
           labels[idx] = 1  # Positive
       elif severity <= low_severity_threshold:
           labels[idx] = -1  # Negative
       else:
           labels[idx] = 0  # No label
  4. GCN Classification:

    • Apply a GCN classifier to predict labels for intermediate nodes.
    • Example code using Scikit-Network's GCN classifier:

      from sknetwork.classification import GCNClassifier
      
      # Prepare adjacency matrix for the GCN
      adjacency = nx.adjacency_matrix(G)
      
      # Known labels (use only labeled nodes for training)
      known_labels = {node: label for node, label in labels.items() if label != 0}
      
      # Initialize and train GCN https://scikit-network.readthedocs.io/en/master/tutorials/gnn/gnn_classifier.html
  5. Event Filtering:

    • Filter out events classified with a negative label for improved data analysis into CLIMADA-BR.
    • Example code:

      # Filter data based on GCN classification
      filtered_df = df[df['gcn_label'] != -1]

Expected Outcomes